Reply
Explorer
Posts: 62
Registered: ‎01-22-2014
Accepted Solution

Control the number of files created from Spark Streaming

 

Hi,

 

I am using Spark Streaming to write data into HDFS using the below program.

 

import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.api.java.function._
import org.apache.spark.streaming._
import org.apache.spark.streaming.api._
import org.apache.spark.streaming.StreamingContext._
import StreamingContext._
 
val ssc8 = new StreamingContext(sc, Seconds(60)) 
val lines8 = ssc8.socketTextStream("IP", 19900)     port alone need to be changed

lines8.saveAsTextFiles("hdfs://nameservice1/user/sample4")           need to be changed 

lines8.print()
ssc8.start() 
ssc8.awaitTermination()

 

 

The output gets written to HDFS as a directory. But inside the directory, there are a huge number of part files (Mine is a cluster with 30+ spark nodes).

 

Please let me know about the below. Thanks!

 

1. I want to control the number of files created in the directory in HDFS. Is there a way to do it (by controlling based on number of        

     events or size of data etc) as we do it in Flume.

 

2. The files started getting created in HDFS only after I killed the program. Is there a way to write the files as soon as the streams are

   read?

 

Below is a sample of the files created.

 

 

drwx--x---   - USER1 USER1          0 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/_temporary
-rw-------   3 USER1 USER1        706 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00000
-rw-------   3 USER1 USER1       1412 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00001
-rw-------   3 USER1 USER1       1412 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00002
-rw-------   3 USER1 USER1       1418 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00003
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00004
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00005
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00006
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00007
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00008
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00009
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00010
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00011
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00012
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00013
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00014
-rw-------   3 USER1 USER1       1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00015
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00016
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00017
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00018
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00019
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00020
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00021
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00022
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00023
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00024
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00025
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00026
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00027
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00028
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00029
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00030
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00031
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00032
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00033
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00034
-rw-------   3 USER1 USER1       1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00035

 

 

 

 

Cloudera Employee
Posts: 366
Registered: ‎07-29-2013

Re: Control the number of files created from Spark Streaming

You should probably decrease the number of partitions. Fewer partitions means fewer workers but evidently you need nowhere near 30 workers to keep up. This reduces the number of files per interval. You can use a longer interval too. Finally, you can post-process these files to do something else with them, including combining them and deleting the originals if desired.

Explorer
Posts: 62
Registered: ‎01-22-2014

Re: Control the number of files created from Spark Streaming

Thanks for the solution.Will try the options available and give the feedback..