Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best way to dump all data from Spark Streaming job into HDFS

avatar
Explorer

I have a job that calculates some statistics for a short rolling window time period and would like to be able to dump all the data into HDFS.  I have come to learn HDFS does not support appends.  Attempting to set my Spark app to make a new directory and and write to a new file for every RDD is not viable.  After searching around I found an Avro object DataFileWriter which looks like it would work but according to the Spark user group message referenced below the object won't seriealize so it won't make it out to the worker nodes.  I have read that SparkSQL can consume from Kafka and then write to a parquet file which seems like it would solve my problem but Cloudera does not include SparkSQL.

 

Would it be out of the question to try to get SparkSQL and have it write to my CDH HDFS?

I don't think I would be able to hook those two up.

 

Does anyone know of possible solutions to the problem I have?

 

http://apache-spark-user-list.1001560.n3.nabble.com/Persisting-Avro-files-from-Spark-streaming-td109...

 

 

1 ACCEPTED SOLUTION

avatar
Master Collaborator

Yeah, because it makes lots of small files? one option is to have a post-processing job that getmerges the files together.

 

The general answer to getting an unserializable object to the workers is to create them on the workers instead. You would make your writer or connection object once per partition and do something with it. 

 

Spark SQL is distributed as part of CDH. Lots of stuff can consume from Kafka. You don't need it to write to Parquet files.

View solution in original post

2 REPLIES 2

avatar
Master Collaborator

Yeah, because it makes lots of small files? one option is to have a post-processing job that getmerges the files together.

 

The general answer to getting an unserializable object to the workers is to create them on the workers instead. You would make your writer or connection object once per partition and do something with it. 

 

Spark SQL is distributed as part of CDH. Lots of stuff can consume from Kafka. You don't need it to write to Parquet files.

avatar
Explorer

Thanks!