Support Questions
Find answers, ask questions, and share your expertise

Avro Small Files Problem

Avro Small Files Problem

Champion Alumni



We have a  use case for ingesting binary files from mainframe to HDFS in avro format.These binary files contain  different record types that are variable in length .The first 4 bytes denotes the length of  the record.I have written a stand alone java program to  ingest the data  to hdfs using Avro DataFileWriter.Now these files from  mainframe are much smaller in size (under a block size).I have  been creating one output avro file for one input file .What is the best option in this case from performance and maintainability stand point?.Can we have just one file for one record type and append to that file .This file would become very huge in future.Another option would be to have one avro file per day.Please let me know which one would be best suited.