Created on 08-13-2014 08:22 AM - edited 09-16-2022 02:04 AM
Hi,
I am using Spark Streaming to write data into HDFS using the below program.
import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.api.java.function._ import org.apache.spark.streaming._ import org.apache.spark.streaming.api._ import org.apache.spark.streaming.StreamingContext._ import StreamingContext._ val ssc8 = new StreamingContext(sc, Seconds(60)) val lines8 = ssc8.socketTextStream("IP", 19900) port alone need to be changed lines8.saveAsTextFiles("hdfs://nameservice1/user/sample4") need to be changed lines8.print() ssc8.start() ssc8.awaitTermination()
The output gets written to HDFS as a directory. But inside the directory, there are a huge number of part files (Mine is a cluster with 30+ spark nodes).
Please let me know about the below. Thanks!
1. I want to control the number of files created in the directory in HDFS. Is there a way to do it (by controlling based on number of
events or size of data etc) as we do it in Flume.
2. The files started getting created in HDFS only after I killed the program. Is there a way to write the files as soon as the streams are
read?
Below is a sample of the files created.
drwx--x--- - USER1 USER1 0 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/_temporary -rw------- 3 USER1 USER1 706 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00000 -rw------- 3 USER1 USER1 1412 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00001 -rw------- 3 USER1 USER1 1412 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00002 -rw------- 3 USER1 USER1 1418 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00003 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00004 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00005 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00006 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00007 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00008 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00009 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00010 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00011 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00012 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00013 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00014 -rw------- 3 USER1 USER1 1424 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00015 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00016 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00017 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00018 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00019 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00020 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00021 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00022 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00023 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00024 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00025 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00026 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00027 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00028 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00029 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00030 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00031 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00032 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00033 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00034 -rw------- 3 USER1 USER1 1624 2014-08-13 06:54 /user/USER1/firststream1-1407927240000/part-00035
Created 08-13-2014 08:26 AM
You should probably decrease the number of partitions. Fewer partitions means fewer workers but evidently you need nowhere near 30 workers to keep up. This reduces the number of files per interval. You can use a longer interval too. Finally, you can post-process these files to do something else with them, including combining them and deleting the originals if desired.
Created 08-13-2014 08:26 AM
You should probably decrease the number of partitions. Fewer partitions means fewer workers but evidently you need nowhere near 30 workers to keep up. This reduces the number of files per interval. You can use a longer interval too. Finally, you can post-process these files to do something else with them, including combining them and deleting the originals if desired.
Created 08-13-2014 11:05 PM
Thanks for the solution.Will try the options available and give the feedback..