- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Small Avro Files
- Labels:
-
HDFS
Created on ‎11-20-2017 01:48 PM - edited ‎09-16-2022 05:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
We have a use case for ingesting binary files from mainframe to HDFS in avro format.These binary files contain different record types that are variable in length .The first 4 bytes denotes the length of the record.I have written a stand alone java program to ingest the data to hdfs using Avro DataFileWriter.Now these files from mainframe are much smaller in size (under a block size) and creates small files .
Some of the options we came up with to avoid these are
1. Convert the batch process to more of a service that runs behind the scene ,so the avro datafile writer can keep running and flush the data based on certain interval (time/size ) . I do not see a default implementation for this right now .
2. Write the data into an hdfs tmp location,merge the files every hour or so and move the files to final hdfs destination. We can afford a latency of an hour before data is made available to consumers.
3. Make use of avro append functionality.
Appreciate your help!
Created on ‎12-29-2017 07:35 PM - edited ‎12-29-2017 07:36 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why not compacting the historical data ... for example compact daily files into one file for now-14days.
A compaction job that runs daily and compact the data before 2 weeks.
By this you can make sure you are not imapcting the data freshness.
