We have a use case for ingesting binary files from mainframe to HDFS in avro format.These binary files contain different record types that are variable in length .The first 4 bytes denotes the length of the record.I have written a stand alone java program to ingest the data to hdfs using Avro DataFileWriter.Now these files from mainframe are much smaller in size (under a block size) and creates small files .
Some of the options we came up with to avoid these are
1. Convert the batch process to more of a service that runs behind the scene ,so the avro datafile writer can keep running and flush the data based on certain interval (time/size ) . I do not see a default implementation for this right now .
2. Write the data into an hdfs tmp location,merge the files every hour or so and move the files to final hdfs destination. We can afford a latency of an hour before data is made available to consumers.
3. Make use of avro append functionality.
Appreciate your help!
Why not compacting the historical data ... for example compact daily files into one file for now-14days.
A compaction job that runs daily and compact the data before 2 weeks.
By this you can make sure you are not imapcting the data freshness.