Champion Alumni
Posts: 160
Registered: ‎02-11-2014

Merging Small Avro Files in HDFS



I have a  java batch process that reads binary files from legacy platform ,converts to avro and wirtes  to HDFS .I have variable length records on that file and  do  not want to land them directly on hdfs .So they land on an edge Node as  of now . The batch process runs every hour and creates files  that are 10 to 15 Mb in size .Since this is not ideal for name node we run a merge process every 24 hours to merge these files .. This hadoop cluster is not used for  analytics and  all the  canned reports via hive queries access only data for the last 24 hours (hence we are  able to run the merge job after 24 hours ). Now  since we plan to bring in more of these data from legacy systems  we have a need to run the merge process  more often (every couple of hours or so instead of 24 hours).  There could be other processes like hive queries running in the system during this time accessing the files that are candidates for merge then the merge process should wait for the process to be completed(since there is a mv involved).Is there a technical solution on  how this could be accomplished?