Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Is it possible to write the spark streaming output to single file in HDFS ? where spark streaming get's the logs from kafka topics.

avatar
 
1 ACCEPTED SOLUTION

avatar
Guru
7 REPLIES 7

avatar

@Vijay Kumar J any idea ? thanks in advance

avatar

@Greg Keys any idea ? thanks in advance

avatar
Guru

I suggest looking at the merge and saveAsTextFile functions as per bottom post here http://stackoverflow.com/questions/31666361/process-spark-streaming-rdd-and-store-to-single-hdfs-fil...

avatar

Hi Greg Keys, thanks for the reply i was using the similar approach, but wondering whether this approach works if spark streaming processing the data in giga bytes ?

avatar
Guru

That is really an issue of scaling (how many nodes and memory per node you have) and multitenancy (which other jobs will run at the same time, particularly spark or other memory-intensive jobs). The more nodes and the less memory contention, the more data you can process in spark.

avatar

i am working on 12 node cluster with 4 having 126 gigs, 8 having 252 gigs memory.

avatar
Guru

What is the largest load (MBs or GBs) you have run your use case on?