About dchaffey

dchaffey · ‎06-19-2017

@yjiang Glad to hear it's working - there should be a couple of bash scripts in /usr/hdf/current/streamline/bootstrap that are better to use than the sql directly - as it also creates the UDFs and roles and whatnot. I'd suggest you run ./bootstrap-storage.sh drop-create and then ./bootstrap.sh to ensure you have a clean install.

dchaffey · ‎06-16-2017

@yjiang @Andres Koitmäe @Jerry Johnson This message is usually because you are using an unsupported database version, most commonly because you installed SAM against the Ambari Database - unfortunately at this time they do not support cross-compatible versions. For your reference, the support matrices are here: https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.0.0/bk_support-matrices/content/ch_matrices-hdf.html

dchaffey · ‎05-31-2017

@James Dong @Hans Pointner This stylesheet is incomplete for arbitrary XML conversion, I suggest you try https://github.com/bramstein/xsltjson I've documented use of it with the TransformXML processor in a new article here: https://community.hortonworks.com/content/kbentry/105547/nifi-xml-to-json-shredding-a-generalised-solution-3.html

dchaffey · ‎05-31-2017

I'm going to cover a simple OSS solution to Shredding XML in NiFi, and demonstrate how you can chain simple steps together to achieve common data shredding tasks. Feel free to get in touch if you need to achieve something more complex than these basic steps will allow. We will be covering: Procedurally converting Xml to Json using a fast XSLT 2.0 template Constructing Jolt transforms to extract nested subsections of JSON documents Constructing JsonPath expressions to split multi-record JSON documents Procedurally flattening complex nested JSON for easy querying This process is shown on NiFi-1.2.0, and tested on a variety of XML documents, but most notably a broad collection of GuideWire sample XMLs as part of a Client PoC. The XML examples below have retained the nested structure but anonymised the fields. XML to JSON Here we combine the NiFI TransformXML processor with the excellent BSD-licensed xsltjson procedural converter found at https://github.com/bramstein/xsltjson. Simply check out the repo and set the XSLT filename in the processor to xsltjson/conf/xml-to-json.xsl. There are several conversion options present, I suggest the Badgerfish notation if you want an easier time of validating your conversion accuracy, but the default conversion is suitably compact from uncomplicated XMLs. So your Input XML might look something like this: <BrokerResponse> <aggsId>3423897f9w8v89yb99873r</aggsId> <quote> <brandID>AB</brandID> <brandDescription>Corp</brandDescription> <quoteReference>0023400010050105</quoteReference> <annualPremium>271.45</annualPremium> <totalPremiumPNCD>304.56</totalPremiumPNCD> <pncdIndicator>true</pncdIndicator> <productType>Private Car</productType> <insurerName>SomeRandom Insurance Company Limited</insurerName> <coverType>comprehensive</coverType> <instalments> <instalmentScheme>12 at 13.9% (qr:27)</instalmentScheme> <instalmentType>Monthly</instalmentType> <downPayment>29.18</downPayment> <downPaymentPercentage>8.3385725</downPaymentPercentage> <totalInstalmentPremium>349.94</totalInstalmentPremium> <paymentAmount>29.16</paymentAmount> <noOfPayments>11</noOfPayments> <interestAmount>45.38</interestAmount> <apr>42.8</apr> </instalments> <vehicle> <excess> <name>PCAccidentalDamageCov_Ext</name> <amount>95.0</amount> </excess> ... etc. And your output would look something like this (these strings aren't identical due to my data anonymization): { "BrokerResponse" : { "aggsId" : "4598e79g8798f298f", "quote" : [ { "brandID" : "AB", "brandDescription" : "Corp", "quoteReference" : "0000120404010", "annualPremium" : 271.45, "totalPremiumPNCD" : 304.56, "pncdIndicator" : true, "productType" : "Private Car", "insurerName" : "SomeRandom Insurance Company Limited", "coverType" : "comprehensive", "instalments" : { "instalmentScheme" : "12 at 12.3% (qr:33)", "instalmentType" : "Monthly", "downPayment" : 29.18, "downPaymentPercentage" : 8.3385725, "totalInstalmentPremium" : 349.94, "paymentAmount" : 29.16, "noOfPayments" : 11, "interestAmount" : 45.38, "apr" : 29.9 }, { "brandID" : "BC", "brandDescription" : "Acme Essential", "quoteReference" : "NA", "isDeclined" : true, "quoteErrors" : { "errorCode" : "QUOTE_DECLINED", "errorDescription" : "Quote Declined" } } } ] } } Using Jolt to extract sections Coming to both XSLT and Jolt as a new user, I found Jolt far easier to learn and use - Relying on the every popular StackExchange, Jolt answers tended to teach you to fish, whereas XSLT answers were usually selling you a fish. Handily, NiFi has a built in editor if you use the Advanced button on the JoltTransformJSON processor, this mimics the behaviour on the popular http://jolt-demo.appspot.com/ site for building your transforms. A key thing to note is setting the Jolt DSL to 'Chain' in the NiFi processor, and then using your various 'spec' settings within the Transforms specified. This will align the NiFi processor behaviour with the Jolt-demo. Building a Jolt spec is about defining steps from the root of the document, and there are excellent guides elsewhere on the internet, but here is a simple but useful example. Given the previous example of Xml converted to Json, this Jolt transform would check each quote subsection of the BrokerResponse, and if it contains an instalments section, return it in an array called quoteOffers, and drop any quotes that don't contain an Instalments section, such as the declined offers: [ { "operation": "shift", "spec": { "BrokerResponse": { "quote": { "*": { "instalments": { "@1": "quoteOffers[]" } } } } } } ] This next Jolt transform would select just the Instalments section from the previous output of quoteOffers, and drop the rest of the details: [ { "operation": "shift", "spec": { "quoteOffers": { "*": { "instalments": { "@0": "instalments[]" } } } } } ] Much simpler than XSLT! Using JsonPath to split documents This is a very simple process, again with good examples available out on the wider internet. Using the above example again, if we received multiple quoteResponses in a single document we'd then have multiple instalment responses, and we might want to split them out into one quote per document, this would be as simple as using the following: $.instalments.* This specifies the root of the document using $, the instalments array, and then emitting each child item as a separate Flowfile. Flattening Json Something else you might want to do is Flatten your complex nested structures into simple iterables without having to specify a schema. This can be really useful if you just want to load the shredded XML for further analysis in Python without having traverse the structure to get at the bits you're interested in. I came an the excellent Apache licensed java lib at https://github.com/wnameless/json-flattener, which I have wrapped into a NiFi-1.2-0 compatible processor at https://github.com/Chaffelson/nifi-flatjson-bundle. There are many more options within the lib that I have not taken the time to expose yet, including making it reversible! Again using our example XML document from above, the flattened output might look a bit like this: { "quoteOffers[0].brandID" : "AB", "quoteOffers[0].brandDescription" : "Corp", "quoteOffers[0].quoteReference" : "004050025001001", "quoteOffers[0].annualPremium" : 271.45, "quoteOffers[0].totalPremiumPNCD" : 304.56, "quoteOffers[0].pncdIndicator" : true, "quoteOffers[0].productType" : "Private Car", "quoteOffers[0].insurerName" : "SomeRandom Insurance Company Limited", "quoteOffers[0].coverType" : "comprehensive", "quoteOffers[0].instalments.instalmentScheme" : "12 at 13.9% (qr:2)2", "quoteOffers[0].instalments.instalmentType" : "Monthly", "quoteOffers[0].instalments.downPayment" : 29.18, "quoteOffers[0].instalments.downPaymentPercentage" : 8.3385725, "quoteOffers[0].instalments.totalInstalmentPremium" : 349.94, "quoteOffers[0].instalments.paymentAmount" : 29.16, "quoteOffers[0].instalments.noOfPayments" : 11, "quoteOffers[0].instalments.interestAmount" : 45.38, "quoteOffers[0].instalments.apr" : 23.9, "quoteOffers[0].vehicle.excess[0].name" : "PCAccidentalDamageCov_Ext", "quoteOffers[0].vehicle.excess[0].amount" : 95.0, "quoteOffers[0].vehicle.excess[1].name" : "PCLossFireTheftCov_Ext", "quoteOffers[0].vehicle.excess[1].amount" : 95.0, "quoteOffers[0].vehicle.excess[2].name" : "PCTheftKeysTransmitterCov_Ext", "quoteOffers[0].vehicle.excess[2].amount" : 95.0, "quoteOffers[0].vehicle.excess[3].name" : "PCGlassDmgWrepairdmgCT_Ext", "quoteOffers[0].vehicle.excess[3].amount" : 25.0, "quoteOffers[0].vehicle.excess[4].name" : "PCGlassDmgWreplacementdmgCT_Ext", "quoteOffers[0].vehicle.excess[4].amount" : 85.0, "quoteOffers[0].vehicle.excess[5].name" : "Voluntary Excess", "quoteOffers[0].vehicle.excess[5].amount" : 100.0, ... etc. Conclusion So there you have it, with only 3 lines of code we've converted arbitrary nested XML into JSON, filtered out bits of the document we don't want (declined quotes), extracted the section of the quotes we want to process (quoteOffers), split each quote into a single document (Instalments), and then flattened the rest of the quoteResponse into a flat JSON document for further analysis. Feel free to contact me if you have a shredding challenge we might be able to help you with.

dchaffey · ‎05-18-2017

Here is a gross simplification that might be helpful: Exactly-once usually requires that the source system and destination system can somehow be made to agree on a method to manage at-least-once delivery and data-deduplication. NiFi could be the transport layer providing at-least-once delivery between the systems, but Kafka to NiFi without those semantics or some additional approach will not satisfy those requirements alone.

dchaffey · ‎05-16-2017

Glad to hear it!

dchaffey · ‎05-15-2017

Hi @Raghav Ramakrishann sorry I only just saw this comment as I've been away on Paternity leave. Can you share the version of CDH you're connecting to, and your service parameters? I might be able to troubleshoot a bit.

dchaffey · ‎04-12-2017

I recently did a PoC with a customer to integrate NiFi with CDH, part of this was creating external tables in Hive on the newly loaded data. In this article I will share the approaches, useful workarounds, how to customise your own NiFi build for backwards compatibility, and provide a pre-built CDH-compatible Hive Bundle for you to download and try. So first, why is this necessary? Well the short answer is that NiFi 1.x's minimum supported version of Hive is 1.2.x, but CDH uses a fork of Hive1.1.x, which introduces two common backwards compatibility challenges: The first is that it uses an older version of Thrift, so we need to configure NiFi to use this same version if we want to talk directly. The second is that new features introduced after version 1.1.0 aren't available in the CDH release, so we have to stop NiFi from looking for them. The obvious other option here is to work with CDH Hive indirectly, and thus we come to the workarounds. Workarounds: It is very common in PoCs to not have all the software and configuration parameters exactly as you would like them to be, and to have no time to wait for change control to allow installs and firewall modifications. One of the great things about NiFi is the flexibility to quickly work around roadblocks, so here's the list of workarounds investigated: The WebHCat service provides a RESTApi to run Hive queries which we could've accessed using the NiFi HTTP processors; unfortunately the port was blocked at the firewall. The Beeline client could've been run via the NiFi Execute processors; however the NiFi server was outside the test CDH cluster and there was no available license for installing another gateway, nor time for the change control. Stream the Hive queries in a bash runner via an SSH tunnel into an existing edge node on the test CDH cluster using NiFi ExecuteStream processors; this works, but breaks various rules. Modify the NiFi-Hive processors to be Cloudera compatible, if not officially supported... A pre-built NiFi-Hive bundle for CDH 5.10.0: Note that I have only tested the Hive bundle functionality against CDH5.10.0, not any of the other processors such as HDFS or Kafka nor other versions. Neither I nor Hortonworks offer guarantees that this or other services will work against CDH and you should thoroughly test things before trusting them with important data. Here is a Hive-Bundle I've built for CDH5.10.0, just copy it into your nifi/lib directory and restart the service, you should be able to connect the PutHiveQL and SelectHiveQL to your Hive2 service. (dropbox link to file) How to create your own Cloudera-compatible NiFi Hive Bundle: The following instructions were tested on a Centos7 VM. ssh <build server FQDN> sudo su - yum update -y yum install -y wget # Install Maven, Java1.8, Git, to meet minimum NiFi build requirements. wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo yum install -y git java-1.8.0-openjdk apache-maven logout git clone https://github.com/Chaffelson/nifi.git cd nifi git checkout nifi-1.1.x-cdhHiveBundle mvn -T C2.0 clean install -Pcloudera -Dhive.version=1.1.0-cdh5.10.0 -Dhive.hadoop.version=2.6.0-cdh5.10.0 -Dhadoop.version=2.6.0-cdh5.10.0 -DskipTests nifi-assembly/target/nifi-1.1.1-SNAPSHOT-bin/nifi-1.1.1-SNAPSHOT/bin/nifi.sh start # browse to http://<build server FQDN>:8080/nifi to test your new hive bundle I have created a branch of NiFi-1.1.x and modified it so the Hive Bundle is backwards compatible with CDH, and rolled in an updated fix or two for your convenience, here's a link to the diff You may need to change the listed CDH versions to match your environment, I suggest you use the CDH Maven Repository documentation pages

dchaffey · ‎03-30-2017

@Praveen Singh These processors just replicate whatever you could do with piping together bash commands, more or less - so if you want to use an unsecured cleartext password in the processor you could try sshpass or other shell plugins to enable it. Basically if you can solve the problem by piping together bash commands, you should be able to do it in the processor. But I wouldn't recommend using cleartext passwords, keys are far more secure.

dchaffey · ‎03-29-2017

Are you sure this environment variable is set for the NiFi user, and not just for the user you are ssh'd in as? A test for this would be to invoke a common system variable like USER and see what you get.

Online	Offline
Last Visited	‎03-11-2019 05:26 PM

Member Since	‎02-08-2016 10:50 AM
Last Visited	‎03-11-2019 05:26 PM
Posts	80
Kudos received	88

Cloudera Community

Re: nipyapi canvas.list_all_processors lists only ...

Re: Conda Environment in Nifi

Re: Why it costs about 2 hours when we start nifi ...

Re: Can I get a simple complete rest client code w...

Re: can set the number of cocurrent running jobs i...

Re: Failed to Start SAM

Re: Failed to Start SAM

Re: NiFi - Converting XML to JSON

NiFi XML to JSON shredding: A generalised solutio...

Re: How do we achieve exactly-once processing in N...

Re: Connecting NiFi to CDH Hive

Re: Connecting NiFi to CDH Hive

Connecting NiFi to CDH Hive

Re: Executing Shell/Python script to a remote mach...

Re: NIFI exceute script : Not reading the psql pas...