Reply
Champion Alumni
Posts: 161
Registered: ‎02-11-2014

Small Avro Files

Hello,

 

We have a  use case for ingesting binary files from mainframe to HDFS in avro format.These binary files contain  different record types that are variable in length .The first 4 bytes denotes the length of  the record.I have written a stand alone java program to  ingest the data  to hdfs using Avro DataFileWriter.Now these files from  mainframe are much smaller in size (under a block size) and creates small files .

Some of the   options we came up with  to avoid these are 

 

1. Convert the batch process to more   of a  service that runs behind the scene ,so the avro datafile writer can keep running and flush the data based on certain interval (time/size ) . I do not see a default implementation  for this  right now .

2. Write the data into an hdfs tmp location,merge the files   every hour or so and move the files to final hdfs destination. We can afford a latency of an hour before  data is made available to consumers.

 

3. Make use of avro append functionality.

 

Appreciate  your help!

Highlighted
Expert Contributor
Posts: 318
Registered: ‎01-25-2017

Re: Small Avro Files

[ Edited ]

Why not compacting the historical data ... for example compact daily files into one file for now-14days.

 

A compaction job that runs daily and compact the data before 2 weeks.

 

By this you can make sure you are not imapcting the data freshness.

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.