Support Questions
Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

XML File Ingestion framework

New Contributor

Hi All,

I have a file ingestion scenario , where in I'll be getting multiple xml files at same time and have to ingest into Hive tables.

  • At 11AM , I will be several zip files. Each zip file having 7K+ xml files.
  • Size of 1zip file is around 50MB and when unzipped is 500+MB.
  • Total number of zip files coming at 11AM would be around 130.
  • I have to unzip each zip file , remove the Linefeed characters from XML's and have to merge all those xml's from all the zip files.
  • Same number of files will come again at 9PM. So I get a window of 10hrs to process the morning batch of files.
  • Now this one big file has to be ingested. This is my current approach.

If some one can suggest any better way , would be great help to create this file ingestion framework. My main concern is the number of files coming in one go and having a short window to process the huge file. How this work can be done in parallel or distributed manner ?




@Arun Yadav

Apache NiFi (part of HDF DataFlow) allows you to build a pipeline capable to perform all your needed actions.

You would build a flow that for most part will look like this:


Here are some good references:

With the latest versions, you can also benefit from the use of Record-based processors.


If this was helpful, please vote. Also, accept the best answer.

I am not clear of the benefit to merge all these XML files into a single huge XML file that seems to be in order for tens of GB. NiFi has a default limit of 1 GB per flowfile, but that can be changed, however, tens of GB is a huge single file. What happens with this file eventually? What efficient method is used to ingest such a file instead of multiple files. Any tool I know ingests better multiple files sized properly such that parallelism can be properly achieved. XML is not the most optimal format for a large file ingest. I'd love to hear more about the reasoning around this one big file to be ingested and why does it have to be still XML and not a more efficient format. NiFi could have converted XML to something else.

An alternative of NiFi for this task would be to use Spark with XML processing framework.