Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Many small files in HDFS using Spark streaming

Many small files in HDFS using Spark streaming

Explorer

Hi, we are ingesting HL7 messages to Kafka and HDFS via micro batches (Spark streaming).  The spark streaming jobs are creating thousands of very small files in HDFS (many KB in size) for every batch interval which is driving our block count way up.

 

We were using Flume, and we could define the “rollSize” at 256MB which is our block size.  With the micro batches, is there a way we could keep a smaller batch time and still save larger files in HDFS?  I guess we could have a "rollup" job run periodically, but was just looking for other opinions.

 

Thank you kindly,

Mike

1 REPLY 1

Re: Many small files in HDFS using Spark streaming

Master Guru