Reply
Highlighted
Explorer
Posts: 9
Registered: ‎10-29-2014
Accepted Solution

Best way to dump all data from Spark Streaming job into HDFS

I have a job that calculates some statistics for a short rolling window time period and would like to be able to dump all the data into HDFS.  I have come to learn HDFS does not support appends.  Attempting to set my Spark app to make a new directory and and write to a new file for every RDD is not viable.  After searching around I found an Avro object DataFileWriter which looks like it would work but according to the Spark user group message referenced below the object won't seriealize so it won't make it out to the worker nodes.  I have read that SparkSQL can consume from Kafka and then write to a parquet file which seems like it would solve my problem but Cloudera does not include SparkSQL.

 

Would it be out of the question to try to get SparkSQL and have it write to my CDH HDFS?

I don't think I would be able to hook those two up.

 

Does anyone know of possible solutions to the problem I have?

 

http://apache-spark-user-list.1001560.n3.nabble.com/Persisting-Avro-files-from-Spark-streaming-td109...

 

 

Cloudera Employee
Posts: 366
Registered: ‎07-29-2013

Re: Best way to dump all data from Spark Streaming job into HDFS

Yeah, because it makes lots of small files? one option is to have a post-processing job that getmerges the files together.

 

The general answer to getting an unserializable object to the workers is to create them on the workers instead. You would make your writer or connection object once per partition and do something with it. 

 

Spark SQL is distributed as part of CDH. Lots of stuff can consume from Kafka. You don't need it to write to Parquet files.

Explorer
Posts: 9
Registered: ‎10-29-2014

Re: Best way to dump all data from Spark Streaming job into HDFS

Thanks!