About mburgess

mburgess · ‎10-03-2016

I've left a possible solution as a separate answer. Doing all the processing with a Python script is not ideal, as you'd need your own Hadoop/Hive client libraries and all you'd use NiFi for is executing the external Python script. However if you just need some custom processing during the flow, you can use ExecuteScript (link in my other answer) with Jython, I have some examples on my blog.

mburgess · ‎10-03-2016

Wherever the error happens in the flow (sounds like PutHDFS in your example), there is likely a "failure" relationship (or something of the kind) for that processor. You can route failed flow files to a separate branch, where you can perform your error handling. For your example, you can have PutHDFS route "failure" to an UpdateAttribute that sets some attribute like "status" to "error", and PutHDFS could route "success" to an UpdateAttribute that sets "status" to "success". Assuming your Hive table is created atop CSV files, then at this point you could route both back to a ReplaceText that creates a comma-separated line with the values, using Expression Language to get the date, filename, and the value of the status attribute, so something like: ${now()},${filename},${status} You should avoid having small files in HDFS, so you wouldn't want to write each individual line as a file to HDFS. Instead consider the MergeContent processor to concatenate many rows together, then use a PutHDFS to stage the larger file in Hadoop for use by Hive. If MergeContent et al doesn't give you the file(s) you need, you can always use an ExecuteScript processor for any custom processing needed. If your Hive table expects Avro or ORC format for the files, there are processors for these conversions as well (although you may have to convert to intermediate formats such as JSON first, see the documentation for more details).

mburgess · ‎10-03-2016

It appears your URL has been scrubbed (which is fine), can you find the character at (0-based) index 73? The URL above looks ok (should recognize underscores, semicolons, the at symbol, etc.). Also if you are using the default database, try explicitly putting 'default' in the URL, so jdbc:hive2://host.name.net:10000/default;principal=hive/_HOST@EXAMPLE.COM. You might also try adding "auth=KERBEROS" to the URL parameters, although I don't think that's required (setting the principal is all that's supposed to be needed).

mburgess · ‎09-29-2016

session.create() will create a new flow file, it won't use an incoming one. For that you will want session.get(), which returns a flow file (or None). If you require an input flow file, be sure to check for None and only continue processing if the flow file != None. There is an example of this on my blog (same site as above but different post).

mburgess · ‎09-26-2016

You could use ListFile in a separate flow, it will keep track of the files it has listed so far, such that if your ExecuteStreamCommand generates more files in the specified location(s), only the new files will be listed the next time ListFile runs. Then ListFile can be routed to FetchFile to get the contents of the new files, etc.

mburgess · ‎09-22-2016

The Hive JDBC driver is included with the Hive processors. It appears your driver URL has "!connect" at the front when it should instead start with the "jdbc:hive2" prefix, removing that should fix the issue.

mburgess · ‎09-22-2016

I'm working on a processor to do this kind of thing: https://issues.apache.org/jira/browse/NIFI-2735

mburgess · ‎09-22-2016

The RouteOnAttribute processor is what you're looking for, you can match on an attribute value (for example), and only route to a "matched" relationship; self-terminating the "unmatched" relationship would cause the FlowFile to be discarded/ignored. Also if you want to do any error handling you could route unmatched flow files to another processor to log or otherwise handle them.

mburgess · ‎09-22-2016

There are a couple of options: 1) If you want one SQL query per parameter, you can use ListFile/FetchFile (or GetFile if you want to repeatedly get the config file) to retrieve the configuration file, then SplitText to split each line (so one parameter per flow file), then ExtractText to get the name and value of the parameter, then ReplaceText to build a SQL query using Expression Language and the name of the parameter (which will fill in the value), such as the example statement you have above. 2) If you want to build a single statement with possibly multiple parameters, you could use ExecuteScript (if you are comfortable writing code in Groovy, Jython, JRuby, JavaScript, or Lua) to read in the configuration file, split the lines to build a map of parameter names to values, then write out a SQL statement with the names and/or values as you have done above.

mburgess · ‎09-19-2016

This is a known issue with the version of Hive (1.2.1) currently packaged with NiFi: https://issues.apache.org/jira/browse/NIFI-2575 caused by: https://issues.apache.org/jira/browse/HIVE-11581 The workaround is to not use Zookeeper for the service discovery.

Online	Offline
Last Visited	‎10-29-2025 03:45 PM

Member Since	‎11-16-2015 02:21 PM
Last Visited	‎10-29-2025 03:45 PM
Posts	905
Kudos received	658

Cloudera Community

Re: Compare data within the JSON using NIFI

Re: how to join three csv files like sql on condit...

Re: How to see the Data Provenance and Lineage in ...

Re: Apache NiFi - RouteText has no matches

Re: Nifi Building error when creating a brand new ...

Re: [Solved] nifi: create a hive table of log erro...

Re: [Solved] nifi: create a hive table of log erro...

Re: NiFi Cannot create JDBC driver of class 'org.a...

Re: Removing flowfile from queue in ExecuteScript ...

Re: NiFi: How to connect to a GetFile Processor?

Re: NiFi Cannot create JDBC driver of class 'org.a...

Re: Apache Nifi - How to calculate SUM or AVERAGE ...

Re: How to ignore/filter flow-files on the base of...

Re: Use NIFI to aggregate data in SQL

Re: SelectHiveQL Fails on java.lang.NullPointerExc...