Support Questions
Find answers, ask questions, and share your expertise

XML File Ingestion framework

New Contributor

Hi All,

I have a file ingestion scenario , where in I'll be getting multiple xml files at same time and have to ingest into Hive tables.

  • At 11AM , I will be several zip files. Each zip file having 7K+ xml files.
  • Size of 1zip file is around 50MB and when unzipped is 500+MB.
  • Total number of zip files coming at 11AM would be around 130.
  • I have to unzip each zip file , remove the Linefeed characters from XML's and have to merge all those xml's from all the zip files.
  • Same number of files will come again at 9PM. So I get a window of 10hrs to process the morning batch of files.
  • Now this one big file has to be ingested. This is my current approach.

If some one can suggest any better way , would be great help to create this file ingestion framework. My main concern is the number of files coming in one go and having a short window to process the huge file. How this work can be done in parallel or distributed manner ?

Regards,

Arun

2 REPLIES 2

Re: XML File Ingestion framework

@Arun Yadav

Apache NiFi (part of HDF DataFlow) allows you to build a pipeline capable to perform all your needed actions.

You would build a flow that for most part will look like this:

screen-shot-2018-03-30-at-22942-pm.png

Here are some good references:

https://pierrevillard.com/2017/09/07/xml-data-processing-with-apache-nifi/

https://community.hortonworks.com/articles/65400/xml-processing-encoding-validation-parsing-splitti....

With the latest versions, you can also benefit from the use of Record-based processors.

+++

If this was helpful, please vote. Also, accept the best answer.

Re: XML File Ingestion framework

I am not clear of the benefit to merge all these XML files into a single huge XML file that seems to be in order for tens of GB. NiFi has a default limit of 1 GB per flowfile, but that can be changed, however, tens of GB is a huge single file. What happens with this file eventually? What efficient method is used to ingest such a file instead of multiple files. Any tool I know ingests better multiple files sized properly such that parallelism can be properly achieved. XML is not the most optimal format for a large file ingest. I'd love to hear more about the reasoning around this one big file to be ingested and why does it have to be still XML and not a more efficient format. NiFi could have converted XML to something else.

An alternative of NiFi for this task would be to use Spark with XML processing framework.