Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best way to dump all data from Spark Streaming job into HDFS

Solved Go to solution
Highlighted

Best way to dump all data from Spark Streaming job into HDFS

Explorer

I have a job that calculates some statistics for a short rolling window time period and would like to be able to dump all the data into HDFS.  I have come to learn HDFS does not support appends.  Attempting to set my Spark app to make a new directory and and write to a new file for every RDD is not viable.  After searching around I found an Avro object DataFileWriter which looks like it would work but according to the Spark user group message referenced below the object won't seriealize so it won't make it out to the worker nodes.  I have read that SparkSQL can consume from Kafka and then write to a parquet file which seems like it would solve my problem but Cloudera does not include SparkSQL.

 

Would it be out of the question to try to get SparkSQL and have it write to my CDH HDFS?

I don't think I would be able to hook those two up.

 

Does anyone know of possible solutions to the problem I have?

 

http://apache-spark-user-list.1001560.n3.nabble.com/Persisting-Avro-files-from-Spark-streaming-td109...

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Best way to dump all data from Spark Streaming job into HDFS

Master Collaborator

Yeah, because it makes lots of small files? one option is to have a post-processing job that getmerges the files together.

 

The general answer to getting an unserializable object to the workers is to create them on the workers instead. You would make your writer or connection object once per partition and do something with it. 

 

Spark SQL is distributed as part of CDH. Lots of stuff can consume from Kafka. You don't need it to write to Parquet files.

2 REPLIES 2

Re: Best way to dump all data from Spark Streaming job into HDFS

Master Collaborator

Yeah, because it makes lots of small files? one option is to have a post-processing job that getmerges the files together.

 

The general answer to getting an unserializable object to the workers is to create them on the workers instead. You would make your writer or connection object once per partition and do something with it. 

 

Spark SQL is distributed as part of CDH. Lots of stuff can consume from Kafka. You don't need it to write to Parquet files.

Re: Best way to dump all data from Spark Streaming job into HDFS

Explorer

Thanks!