Support Questions

Find answers, ask questions, and share your expertise

Is it possible to write the spark streaming output to single file in HDFS ? where spark streaming get's the logs from kafka topics.

avatar
 
1 ACCEPTED SOLUTION

avatar
Guru
7 REPLIES 7

avatar

@Vijay Kumar J any idea ? thanks in advance

avatar

@Greg Keys any idea ? thanks in advance

avatar
Guru

I suggest looking at the merge and saveAsTextFile functions as per bottom post here http://stackoverflow.com/questions/31666361/process-spark-streaming-rdd-and-store-to-single-hdfs-fil...

avatar

Hi Greg Keys, thanks for the reply i was using the similar approach, but wondering whether this approach works if spark streaming processing the data in giga bytes ?

avatar
Guru

That is really an issue of scaling (how many nodes and memory per node you have) and multitenancy (which other jobs will run at the same time, particularly spark or other memory-intensive jobs). The more nodes and the less memory contention, the more data you can process in spark.

avatar

i am working on 12 node cluster with 4 having 126 gigs, 8 having 252 gigs memory.

avatar
Guru

What is the largest load (MBs or GBs) you have run your use case on?