Champion Alumni
Posts: 161
Registered: ‎02-11-2014

Avro Small Files Problem



We have a  use case for ingesting binary files from mainframe to HDFS in avro format.These binary files contain  different record types that are variable in length .The first 4 bytes denotes the length of  the record.I have written a stand alone java program to  ingest the data  to hdfs using Avro DataFileWriter.Now these files from  mainframe are much smaller in size (under a block size).I have  been creating one output avro file for one input file .What is the best option in this case from performance and maintainability stand point?.Can we have just one file for one record type and append to that file .This file would become very huge in future.Another option would be to have one avro file per day.Please let me know which one would be best suited.




The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at