Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Control the number of files created from Spark Streaming

SOLVED Go to solution

Control the number of files created from Spark Streaming

Explorer

 

Hi,

 

I am using Spark Streaming to write data into HDFS using the below program.

 

import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
import org.apache.spark.streaming.api._
import org.apache.spark.streaming.StreamingContext._
import StreamingContext._
 
val ssc8 = new StreamingContext(sc, Seconds(60)) 
val lines8 = ssc8.socketTextStream("IP", 19900)     port alone need to be changed

lines8.saveAsTextFiles("hdfs://nameservice1/user/sample4")           need to be changed 

lines8.print()
ssc8.start() 
ssc8.awaitTermination()

 

 

The output gets written to HDFS as a directory. But inside the directory, there are a huge number of part files (Mine is a cluster with 30+ spark nodes).

 

Please let me know about the below. Thanks!

 

1. I want to control the number of files created in the directory in HDFS. Is there a way to do it (by controlling based on number of        

     events or size of data etc) as we do it in Flume.

 

2. The files started getting created in HDFS only after I killed the program. Is there a way to write the files as soon as the streams are

   read?

 

Below is a sample of the files created.

 

 

drwx--x---   - USER1 USER1          0 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/_temporary
-rw-------   3 USER1 USER1        706 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00000
-rw-------   3 USER1 USER1       1412 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00001
-rw-------   3 USER1 USER1       1412 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00002
-rw-------   3 USER1 USER1       1418 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00003
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00004
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00005
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00006
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00007
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00008
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00009
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00010
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00011
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00012
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00013
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00014
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00015
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00016
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00017
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00018
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00019
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00020
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00021
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00022
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00023
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00024
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00025
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00026
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00027
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00028
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00029
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00030
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00031
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00032
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00033
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00034
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00035

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Control the number of files created from Spark Streaming

Master Collaborator

You should probably decrease the number of partitions. Fewer partitions means fewer workers but evidently you need nowhere near 30 workers to keep up. This reduces the number of files per interval. You can use a longer interval too. Finally, you can post-process these files to do something else with them, including combining them and deleting the originals if desired.

2 REPLIES 2

Re: Control the number of files created from Spark Streaming

Master Collaborator

You should probably decrease the number of partitions. Fewer partitions means fewer workers but evidently you need nowhere near 30 workers to keep up. This reduces the number of files per interval. You can use a longer interval too. Finally, you can post-process these files to do something else with them, including combining them and deleting the originals if desired.

Highlighted

Re: Control the number of files created from Spark Streaming

Explorer

Thanks for the solution.Will try the options available and give the feedback..