I have a file ingestion scenario , where in I'll be getting multiple xml files at same time and have to ingest into Hive tables.
If some one can suggest any better way , would be great help to create this file ingestion framework. My main concern is the number of files coming in one go and having a short window to process the huge file. How this work can be done in parallel or distributed manner ?
Apache NiFi (part of HDF DataFlow) allows you to build a pipeline capable to perform all your needed actions.
You would build a flow that for most part will look like this:
Here are some good references:
With the latest versions, you can also benefit from the use of Record-based processors.
If this was helpful, please vote. Also, accept the best answer.
I am not clear of the benefit to merge all these XML files into a single huge XML file that seems to be in order for tens of GB. NiFi has a default limit of 1 GB per flowfile, but that can be changed, however, tens of GB is a huge single file. What happens with this file eventually? What efficient method is used to ingest such a file instead of multiple files. Any tool I know ingests better multiple files sized properly such that parallelism can be properly achieved. XML is not the most optimal format for a large file ingest. I'd love to hear more about the reasoning around this one big file to be ingested and why does it have to be still XML and not a more efficient format. NiFi could have converted XML to something else.
An alternative of NiFi for this task would be to use Spark with XML processing framework.