Reply
Contributor
Posts: 30
Registered: ‎10-25-2013

Many small files in HDFS using Spark streaming

Hi, we are ingesting HL7 messages to Kafka and HDFS via micro batches (Spark streaming).  The spark streaming jobs are creating thousands of very small files in HDFS (many KB in size) for every batch interval which is driving our block count way up.

 

We were using Flume, and we could define the “rollSize” at 256MB which is our block size.  With the micro batches, is there a way we could keep a smaller batch time and still save larger files in HDFS?  I guess we could have a "rollup" job run periodically, but was just looking for other opinions.

 

Thank you kindly,

Mike

Posts: 1,836
Kudos: 415
Solutions: 295
Registered: ‎07-31-2013

Re: Many small files in HDFS using Spark streaming

Announcements