Rory Byrne February 2016

What is the nature of the Key, Value, and InputFormat types passed to Spark's StreamingContext.fileStream[K, V, F]("directory")

From what I understand, streaming text files from a directory requires a key of type LongWritable, a value of Text, and a format of TextInputFormat. These are passed automatically in the textFileStream() method.

Is the key in that case the line number, with the value being the text on that line?

What should the key and value types be for ParquetInputFormat - and more generally, how can I figure this out for myself regarding other file types?

Also, how do these types relate to the DStream that is returned by the method? If I pass a parquet file which has rows of, say, 100 columns, how will this be parsed into RDDs and DStreams by spark?

Answers


sbtcpf March 2016

For ParquetInputFormat I think the key type must be Void, and the value type an object representing your data.

ssc.fileStream[Void, YourObject, ParquetInputFormat[YourObject]]("hdfs:...")

Post Status

Asked in February 2016
Viewed 3,294 times
Voted 9
Answered 1 times

Search




Leave an answer