Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
Labels (1)

I'm going to cover a simple OSS solution to Shredding XML in NiFi, and demonstrate how you can chain simple steps together to achieve common data shredding tasks.

Feel free to get in touch if you need to achieve something more complex than these basic steps will allow.

We will be covering:

  • Procedurally converting Xml to Json using a fast XSLT 2.0 template
  • Constructing Jolt transforms to extract nested subsections of JSON documents
  • Constructing JsonPath expressions to split multi-record JSON documents
  • Procedurally flattening complex nested JSON for easy querying

This process is shown on NiFi-1.2.0, and tested on a variety of XML documents, but most notably a broad collection of GuideWire sample XMLs as part of a Client PoC. The XML examples below have retained the nested structure but anonymised the fields.

XML to JSON

Here we combine the NiFI TransformXML processor with the excellent BSD-licensed xsltjson procedural converter found at https://github.com/bramstein/xsltjson. Simply check out the repo and set the XSLT filename in the processor to xsltjson/conf/xml-to-json.xsl. There are several conversion options present, I suggest the Badgerfish notation if you want an easier time of validating your conversion accuracy, but the default conversion is suitably compact from uncomplicated XMLs.

So your Input XML might look something like this:

<BrokerResponse>
  <aggsId>3423897f9w8v89yb99873r</aggsId>
  <quote>
    <brandID>AB</brandID>
    <brandDescription>Corp</brandDescription>
    <quoteReference>0023400010050105</quoteReference>
    <annualPremium>271.45</annualPremium>
    <totalPremiumPNCD>304.56</totalPremiumPNCD>
    <pncdIndicator>true</pncdIndicator>
    <productType>Private Car</productType>
    <insurerName>SomeRandom Insurance Company Limited</insurerName>
    <coverType>comprehensive</coverType>
    <instalments>
      <instalmentScheme>12 at 13.9% (qr:27)</instalmentScheme>
      <instalmentType>Monthly</instalmentType>
      <downPayment>29.18</downPayment>
      <downPaymentPercentage>8.3385725</downPaymentPercentage>
      <totalInstalmentPremium>349.94</totalInstalmentPremium>
      <paymentAmount>29.16</paymentAmount>
      <noOfPayments>11</noOfPayments>
      <interestAmount>45.38</interestAmount>
      <apr>42.8</apr>
    </instalments>
    <vehicle>
      <excess>
        <name>PCAccidentalDamageCov_Ext</name>
        <amount>95.0</amount>
      </excess>
... etc. 

And your output would look something like this (these strings aren't identical due to my data anonymization):

{
  "BrokerResponse" : {
    "aggsId" : "4598e79g8798f298f",
    "quote" : [ {
      "brandID" : "AB",
      "brandDescription" : "Corp",
      "quoteReference" : "0000120404010",
      "annualPremium" : 271.45,
      "totalPremiumPNCD" : 304.56,
      "pncdIndicator" : true,
      "productType" : "Private Car",
      "insurerName" : "SomeRandom Insurance Company Limited",
      "coverType" : "comprehensive",
      "instalments" : {
        "instalmentScheme" : "12 at 12.3% (qr:33)",
        "instalmentType" : "Monthly",
        "downPayment" : 29.18,
        "downPaymentPercentage" : 8.3385725,
        "totalInstalmentPremium" : 349.94,
        "paymentAmount" : 29.16,
        "noOfPayments" : 11,
        "interestAmount" : 45.38,
        "apr" : 29.9
      }, {
      "brandID" : "BC",
      "brandDescription" : "Acme Essential",
      "quoteReference" : "NA",
      "isDeclined" : true,
      "quoteErrors" : {
        "errorCode" : "QUOTE_DECLINED",
        "errorDescription" : "Quote Declined"
      }
    }
    }
   ]
  }
} 

Using Jolt to extract sections

Coming to both XSLT and Jolt as a new user, I found Jolt far easier to learn and use - Relying on the every popular StackExchange, Jolt answers tended to teach you to fish, whereas XSLT answers were usually selling you a fish.

Handily, NiFi has a built in editor if you use the Advanced button on the JoltTransformJSON processor, this mimics the behaviour on the popular http://jolt-demo.appspot.com/ site for building your transforms. A key thing to note is setting the Jolt DSL to 'Chain' in the NiFi processor, and then using your various 'spec' settings within the Transforms specified. This will align the NiFi processor behaviour with the Jolt-demo.

Building a Jolt spec is about defining steps from the root of the document, and there are excellent guides elsewhere on the internet, but here is a simple but useful example. Given the previous example of Xml converted to Json, this Jolt transform would check each quote subsection of the BrokerResponse, and if it contains an instalments section, return it in an array called quoteOffers, and drop any quotes that don't contain an Instalments section, such as the declined offers:

[
  {
    "operation": "shift",
    "spec": {
      "BrokerResponse": {
        "quote": {
          "*": {
            "instalments": {
              "@1": "quoteOffers[]"
            }
          }
        }
      }
    }
  }
] 

This next Jolt transform would select just the Instalments section from the previous output of quoteOffers, and drop the rest of the details:

[
  {
    "operation": "shift",
    "spec": {
      "quoteOffers": {
          "*": {
            "instalments": {
              "@0": "instalments[]"
            }
        }
      }
    }
  }
] 

Much simpler than XSLT!

Using JsonPath to split documents

This is a very simple process, again with good examples available out on the wider internet. Using the above example again, if we received multiple quoteResponses in a single document we'd then have multiple instalment responses, and we might want to split them out into one quote per document, this would be as simple as using the following:

$.instalments.* 

This specifies the root of the document using $, the instalments array, and then emitting each child item as a separate Flowfile.

Flattening Json

Something else you might want to do is Flatten your complex nested structures into simple iterables without having to specify a schema. This can be really useful if you just want to load the shredded XML for further analysis in Python without having traverse the structure to get at the bits you're interested in.

I came an the excellent Apache licensed java lib at https://github.com/wnameless/json-flattener, which I have wrapped into a NiFi-1.2-0 compatible processor at https://github.com/Chaffelson/nifi-flatjson-bundle. There are many more options within the lib that I have not taken the time to expose yet, including making it reversible!

Again using our example XML document from above, the flattened output might look a bit like this:

{
  "quoteOffers[0].brandID" : "AB",
  "quoteOffers[0].brandDescription" : "Corp",
  "quoteOffers[0].quoteReference" : "004050025001001",
  "quoteOffers[0].annualPremium" : 271.45,
  "quoteOffers[0].totalPremiumPNCD" : 304.56,
  "quoteOffers[0].pncdIndicator" : true,
  "quoteOffers[0].productType" : "Private Car",
  "quoteOffers[0].insurerName" : "SomeRandom Insurance Company Limited",
  "quoteOffers[0].coverType" : "comprehensive",
  "quoteOffers[0].instalments.instalmentScheme" : "12 at 13.9% (qr:2)2",
  "quoteOffers[0].instalments.instalmentType" : "Monthly",
  "quoteOffers[0].instalments.downPayment" : 29.18,
  "quoteOffers[0].instalments.downPaymentPercentage" : 8.3385725,
  "quoteOffers[0].instalments.totalInstalmentPremium" : 349.94,
  "quoteOffers[0].instalments.paymentAmount" : 29.16,
  "quoteOffers[0].instalments.noOfPayments" : 11,
  "quoteOffers[0].instalments.interestAmount" : 45.38,
  "quoteOffers[0].instalments.apr" : 23.9,
  "quoteOffers[0].vehicle.excess[0].name" : "PCAccidentalDamageCov_Ext",
  "quoteOffers[0].vehicle.excess[0].amount" : 95.0,
  "quoteOffers[0].vehicle.excess[1].name" : "PCLossFireTheftCov_Ext",
  "quoteOffers[0].vehicle.excess[1].amount" : 95.0,
  "quoteOffers[0].vehicle.excess[2].name" : "PCTheftKeysTransmitterCov_Ext",
  "quoteOffers[0].vehicle.excess[2].amount" : 95.0,
  "quoteOffers[0].vehicle.excess[3].name" : "PCGlassDmgWrepairdmgCT_Ext",
  "quoteOffers[0].vehicle.excess[3].amount" : 25.0,
  "quoteOffers[0].vehicle.excess[4].name" : "PCGlassDmgWreplacementdmgCT_Ext",
  "quoteOffers[0].vehicle.excess[4].amount" : 85.0,
  "quoteOffers[0].vehicle.excess[5].name" : "Voluntary Excess",
  "quoteOffers[0].vehicle.excess[5].amount" : 100.0,
... etc. 

Conclusion

So there you have it, with only 3 lines of code we've converted arbitrary nested XML into JSON, filtered out bits of the document we don't want (declined quotes), extracted the section of the quotes we want to process (quoteOffers), split each quote into a single document (Instalments), and then flattened the rest of the quoteResponse into a flat JSON document for further analysis.

Feel free to contact me if you have a shredding challenge we might be able to help you with.

5,236 Views
Comments
New Contributor

Hi Dan,

I am still having trouble creating a HIve table using nested XML file. Coul you please help me. I did include XLST in the XLST tranformation but still unable to load the data.I am having issues with shredding data too

Thanks,

Harish

Not applicable

Hi Dan,

Thanks for such a great tutorial. I would like to know where are you passing the Badgerfish notation while mentioning the xml-to-json.xsl in TransformXML Processor in NiFi?

Regards

Not applicable

Thanks for the tutorial.

Not applicable

You can fully automate the process with third party tools, e.g. have a look at this blog post for converting XML Guidewire to a relational database, Hive, ORC or Parquet
https://sonra.io/2019/08/08/masking-and-converting-guidewire-xml-to-oracle/

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎05-31-2017 09:52 AM
Updated by:
 
Contributors
Top Kudoed Authors