Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

XML File Ingestion framework

Highlighted

XML File Ingestion framework

New Contributor

Hi All,

I have a file ingestion scenario , where in I'll be getting multiple xml files at same time and have to ingest into Hive tables.

  • At 11AM , I will be several zip files. Each zip file having 7K+ xml files.
  • Size of 1zip file is around 50MB and when unzipped is 500+MB.
  • Total number of zip files coming at 11AM would be around 130.
  • I have to unzip each zip file , remove the Linefeed characters from XML's and have to merge all those xml's from all the zip files.
  • Same number of files will come again at 9PM. So I get a window of 10hrs to process the morning batch of files.
  • Now this one big file has to be ingested. This is my current approach.

If some one can suggest any better way , would be great help to create this file ingestion framework. My main concern is the number of files coming in one go and having a short window to process the huge file. How this work can be done in parallel or distributed manner ?

Regards,

Arun

2 REPLIES 2
Highlighted

Re: XML File Ingestion framework

@Arun Yadav

Apache NiFi (part of HDF DataFlow) allows you to build a pipeline capable to perform all your needed actions.

You would build a flow that for most part will look like this:

screen-shot-2018-03-30-at-22942-pm.png

Here are some good references:

https://pierrevillard.com/2017/09/07/xml-data-processing-with-apache-nifi/

https://community.hortonworks.com/articles/65400/xml-processing-encoding-validation-parsing-splitti....

With the latest versions, you can also benefit from the use of Record-based processors.

+++

If this was helpful, please vote. Also, accept the best answer.

Re: XML File Ingestion framework

I am not clear of the benefit to merge all these XML files into a single huge XML file that seems to be in order for tens of GB. NiFi has a default limit of 1 GB per flowfile, but that can be changed, however, tens of GB is a huge single file. What happens with this file eventually? What efficient method is used to ingest such a file instead of multiple files. Any tool I know ingests better multiple files sized properly such that parallelism can be properly achieved. XML is not the most optimal format for a large file ingest. I'd love to hear more about the reasoning around this one big file to be ingested and why does it have to be still XML and not a more efficient format. NiFi could have converted XML to something else.

An alternative of NiFi for this task would be to use Spark with XML processing framework.

Don't have an account?
Coming from Hortonworks? Activate your account here