I'm going to cover a simple OSS solution to Shredding XML in NiFi, and demonstrate how you can chain simple steps together to achieve common data shredding tasks.
Feel free to get in touch if you need to achieve something more complex than these basic steps will allow.
We will be covering:
Procedurally converting Xml to Json using a fast XSLT 2.0 template
Constructing Jolt transforms to extract nested subsections of JSON documents
Constructing JsonPath expressions to split multi-record JSON documents
Procedurally flattening complex nested JSON for easy querying
This process is shown on NiFi-1.2.0, and tested on a variety of XML documents, but most notably a broad collection of GuideWire sample XMLs as part of a Client PoC. The XML examples below have retained the nested structure but anonymised the fields.
XML to JSON
Here we combine the NiFI TransformXML processor with the excellent BSD-licensed xsltjson procedural converter found at https://github.com/bramstein/xsltjson. Simply check out the repo and set the XSLT filename in the processor to xsltjson/conf/xml-to-json.xsl.
There are several conversion options present, I suggest the Badgerfish notation if you want an easier time of validating your conversion accuracy, but the default conversion is suitably compact from uncomplicated XMLs.
Coming to both XSLT and Jolt as a new user, I found Jolt far easier to learn and use - Relying on the every popular StackExchange, Jolt answers tended to teach you to fish, whereas XSLT answers were usually selling you a fish.
Handily, NiFi has a built in editor if you use the Advanced button on the JoltTransformJSON processor, this mimics the behaviour on the popular http://jolt-demo.appspot.com/ site for building your transforms.
A key thing to note is setting the Jolt DSL to 'Chain' in the NiFi processor, and then using your various 'spec' settings within the Transforms specified. This will align the NiFi processor behaviour with the Jolt-demo.
Building a Jolt spec is about defining steps from the root of the document, and there are excellent guides elsewhere on the internet, but here is a simple but useful example.
Given the previous example of Xml converted to Json, this Jolt transform would check each quote subsection of the BrokerResponse, and if it contains an instalments section, return it in an array called quoteOffers, and drop any quotes that don't contain an Instalments section, such as the declined offers:
This is a very simple process, again with good examples available out on the wider internet.
Using the above example again, if we received multiple quoteResponses in a single document we'd then have multiple instalment responses, and we might want to split them out into one quote per document, this would be as simple as using the following:
This specifies the root of the document using $, the instalments array, and then emitting each child item as a separate Flowfile.
Something else you might want to do is Flatten your complex nested structures into simple iterables without having to specify a schema. This can be really useful if you just want to load the shredded XML for further analysis in Python without having traverse the structure to get at the bits you're interested in.
So there you have it, with only 3 lines of code we've converted arbitrary nested XML into JSON, filtered out bits of the document we don't want (declined quotes), extracted the section of the quotes we want to process (quoteOffers), split each quote into a single document (Instalments), and then flattened the rest of the quoteResponse into a flat JSON document for further analysis.
Feel free to contact me if you have a shredding challenge we might be able to help you with.