Small Avro Files

We have a  use case for ingesting binary files from mainframe to HDFS in avro format.These binary files contain  different record types that are variable in length .The first 4 bytes denotes the length of  the record.I have written a stand alone java program to  ingest the data  to hdfs using Avro DataFileWriter.Now these files from  mainframe are much smaller in size (under a block size) and creates small files .

Some of the   options we came up with  to avoid these are 


1. Convert the batch process to more   of a  service that runs behind the scene ,so the avro datafile writer can keep running and flush the data based on certain interval (time/size ) . I do not see a default implementation  for this  right now .

2. Write the data into an hdfs tmp location,merge the files   every hour or so and move the files to final hdfs destination. We can afford a latency of an hour before  data is made available to consumers.


3. Make use of avro append functionality.


Why not compacting the historical data ... for example compact daily files into one file for now-14days.


A compaction job that runs daily and compact the data before 2 weeks.


By this you can make sure you are not imapcting the data freshness.