Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (1)
Contributor

I have a plan to write a 3 part “intro” series as to how to handle your XML files. The subjects will be:

  • Basic XML and Feature Extraction via Text Managment, Splitting and Xpath
  • Interactive Text Handling with XQuery and Regex in relation to XMLs
  • XML schema validation and transformations

XML data is read into the flowfile contents when the file lands in nifi. As long as it is a valid XML format the 5 dedicated XML processors can be applied to it for management and feature extraction. Commonly a user will want to get this XML data into a database which will require us to do a feature extraction and convert to a new format such as JSON or AVRO.

The simplest of the XML processors is the “SplitXml” processor. This simply takes the current selection of data and breaks the children off into their own files. The depth of the split in relation to the root is configurable as shown below. An example of when this may be helpful is when you have a list of events, each of which should be treated seperatly

3171-xmlimg1.jpg

XPath is is a syntax language way of extracting information from an XML. It allows you to search for nodes based on hierarchy, name, or even attribute. It has limited regex integration and has framework for moderately complex queries. More complete documentation can be found here http://www.w3schools.com/xsl/xpath_syntax.asp The processor below shows the “EvaluateXPath” processor being combined with XPath language to extract node data and an attribute. It should not be confused for XQuery which I will cover in my next article.

3172-xmlimg2.png

With executing the Xpath module something very important happens, the xml attributes are now NIFI attributes. This allows us to apply routing and other intelligence that is Nifi's signature. One of the transformations I have previously worked on is how to get the XML data into an AVRO format for easy ingestion. At this time all of the AVRO processors in nifi play nicely with JSONs so the “AttributestoJSON” processor can be used to as an out of the box intermediary to get the format you need. Note that I have set the destination of the processor to “flowfile-contents” which will over-ride the existing XML contents for a JSON.

3173-xmlimg3.png

With a JSON + attributes this is a very easy flow file to work with and can be easily merged into existing workflows or written out to a file for the Hive SerDe.

10,508 Views
Comments
New Contributor

Hi Chris, I have a text file which has multiple messages and the xpath gives an error... when i have only one xml message in the file it works fine.. could you suggest what i could do to read the file with multiple messages

New Contributor

Hi

XML tree is a complex as they are hierarchical and you most likely want a flat structure for easier access of the data.

I just wrapped up the second article of this yesterday, and the code for this is available at GitHub link included in the article.

http://max.bback.se/index.php/2018/06/30/xml-to-tables-csv-with-nifi-and-groovy-part-2-of-2/

The article describe the problem and is providing an implementation for the conversion from XML to CSV by flattening out the XML files, my example XML is flattened out into 4 tables, all depends on how many branches you have that is of the type 1 to many.

/Max

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 12:59 PM
Updated by:
 
Contributors
Top Kudoed Authors