09-12-2017 10:33 AM
I have a java batch process that reads binary files from legacy platform ,converts to avro and wirtes to HDFS .I have variable length records on that file and do not want to land them directly on hdfs .So they land on an edge Node as of now . The batch process runs every hour and creates files that are 10 to 15 Mb in size .Since this is not ideal for name node we run a merge process every 24 hours to merge these files .. This hadoop cluster is not used for analytics and all the canned reports via hive queries access only data for the last 24 hours (hence we are able to run the merge job after 24 hours ). Now since we plan to bring in more of these data from legacy systems we have a need to run the merge process more often (every couple of hours or so instead of 24 hours). There could be other processes like hive queries running in the system during this time accessing the files that are candidates for merge then the merge process should wait for the process to be completed(since there is a mv involved).Is there a technical solution on how this could be accomplished?