02-09-2016 08:31 AM
Hi, we are ingesting HL7 messages to Kafka and HDFS via micro batches (Spark streaming). The spark streaming jobs are creating thousands of very small files in HDFS (many KB in size) for every batch interval which is driving our block count way up.
We were using Flume, and we could define the “rollSize” at 256MB which is our block size. With the micro batches, is there a way we could keep a smaller batch time and still save larger files in HDFS? I guess we could have a "rollup" job run periodically, but was just looking for other opinions.
Thank you kindly,