Support Questions

Nishan · ‎11-20-2017

Hello,

We have a use case for ingesting binary files from mainframe to HDFS in avro format.These binary files contain different record types that are variable in length .The first 4 bytes denotes the length of the record.I have written a stand alone java program to ingest the data to hdfs using Avro DataFileWriter.Now these files from mainframe are much smaller in size (under a block size) and creates small files .

Some of the options we came up with to avoid these are

1. Convert the batch process to more of a service that runs behind the scene ,so the avro datafile writer can keep running and flush the data based on certain interval (time/size ) . I do not see a default implementation for this right now .

2. Write the data into an hdfs tmp location,merge the files every hour or so and move the files to final hdfs destination. We can afford a latency of an hour before data is made available to consumers.

3. Make use of avro append functionality.

Appreciate your help!

Fawze · ‎12-29-2017

Why not compacting the historical data ... for example compact daily files into one file for now-14days.

A compaction job that runs daily and compact the data before 2 weeks.

By this you can make sure you are not imapcting the data freshness.

Cloudera Community

Support Questions

Small Avro Files

How to identify in cdp cluster having small files ...

Small Files in Hadoop

Validating avro schema and json file

Analyze Small FIle in HDFS

Small file in hadoop

hive Insert to Dynamic Partition query Generating ...

Identify where most of the small file are located ...

small files problem

What is Small file problem in HDFS ?

Converting CSV To Avro with Apache NiFi