I have a file ingestion scenario , where in I'll be getting multiple xml files at same time and have to ingest into Hive tables.
At 11AM , I will be several zip files. Each zip file having 7K+ xml files.
Size of 1zip file is around 50MB and when unzipped is 500+MB.
Total number of zip files coming at 11AM would be around 130.
I have to unzip each zip file , remove the Linefeed characters from XML's and have to merge all those xml's from all the zip files.
Same number of files will come again at 9PM. So I get a window of 10hrs to process the morning batch of files.
Now this one big file has to be ingested. This is my current approach.
If some one can suggest any better way , would be great help to create this file ingestion framework. My main concern is the number of files coming in one go and having a short window to process the huge file. How this work can be done in parallel or distributed manner ?