Member since
11-16-2015
905
Posts
665
Kudos Received
249
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 425 | 09-30-2025 05:23 AM | |
| 756 | 06-26-2025 01:21 PM | |
| 649 | 06-19-2025 02:48 PM | |
| 844 | 05-30-2025 01:53 PM | |
| 11359 | 02-22-2024 12:38 PM |
10-03-2016
06:23 PM
I've left a possible solution as a separate answer. Doing all the processing with a Python script is not ideal, as you'd need your own Hadoop/Hive client libraries and all you'd use NiFi for is executing the external Python script. However if you just need some custom processing during the flow, you can use ExecuteScript (link in my other answer) with Jython, I have some examples on my blog.
... View more
10-03-2016
06:21 PM
Wherever the error happens in the flow (sounds like PutHDFS in your example), there is likely a "failure" relationship (or something of the kind) for that processor. You can route failed flow files to a separate branch, where you can perform your error handling. For your example, you can have PutHDFS route "failure" to an UpdateAttribute that sets some attribute like "status" to "error", and PutHDFS could route "success" to an UpdateAttribute that sets "status" to "success". Assuming your Hive table is created atop CSV files, then at this point you could route both back to a ReplaceText that creates a comma-separated line with the values, using Expression Language to get the date, filename, and the value of the status attribute, so something like: ${now()},${filename},${status} You should avoid having small files in HDFS, so you wouldn't want to write each individual line as a file to HDFS. Instead consider the MergeContent processor to concatenate many rows together, then use a PutHDFS to stage the larger file in Hadoop for use by Hive. If MergeContent et al doesn't give you the file(s) you need, you can always use an ExecuteScript processor for any custom processing needed. If your Hive table expects Avro or ORC format for the files, there are processors for these conversions as well (although you may have to convert to intermediate formats such as JSON first, see the documentation for more details).
... View more
10-03-2016
06:05 PM
It appears your URL has been scrubbed (which is fine), can you find the character at (0-based) index 73? The URL above looks ok (should recognize underscores, semicolons, the at symbol, etc.). Also if you are using the default database, try explicitly putting 'default' in the URL, so jdbc:hive2://host.name.net:10000/default;principal=hive/_HOST@EXAMPLE.COM. You might also try adding "auth=KERBEROS" to the URL parameters, although I don't think that's required (setting the principal is all that's supposed to be needed).
... View more
09-29-2016
01:51 PM
1 Kudo
session.create() will create a new flow file, it won't use an incoming one. For that you will want session.get(), which returns a flow file (or None). If you require an input flow file, be sure to check for None and only continue processing if the flow file != None. There is an example of this on my blog (same site as above but different post).
... View more
09-26-2016
06:19 PM
1 Kudo
You could use ListFile in a separate flow, it will keep track of the files it has listed so far, such that if your ExecuteStreamCommand generates more files in the specified location(s), only the new files will be listed the next time ListFile runs. Then ListFile can be routed to FetchFile to get the contents of the new files, etc.
... View more
09-22-2016
08:31 PM
3 Kudos
The Hive JDBC driver is included with the Hive processors. It appears your driver URL has "!connect" at the front when it should instead start with the "jdbc:hive2" prefix, removing that should fix the issue.
... View more
09-22-2016
07:25 PM
1 Kudo
I'm working on a processor to do this kind of thing: https://issues.apache.org/jira/browse/NIFI-2735
... View more
09-22-2016
05:29 PM
2 Kudos
The RouteOnAttribute processor is what you're looking for, you can match on an attribute value (for example), and only route to a "matched" relationship; self-terminating the "unmatched" relationship would cause the FlowFile to be discarded/ignored. Also if you want to do any error handling you could route unmatched flow files to another processor to log or otherwise handle them.
... View more
09-22-2016
02:55 PM
3 Kudos
There are a couple of options: 1) If you want one SQL query per parameter, you can use ListFile/FetchFile (or GetFile if you want to repeatedly get the config file) to retrieve the configuration file, then SplitText to split each line (so one parameter per flow file), then ExtractText to get the name and value of the parameter, then ReplaceText to build a SQL query using Expression Language and the name of the parameter (which will fill in the value), such as the example statement you have above. 2) If you want to build a single statement with possibly multiple parameters, you could use ExecuteScript (if you are comfortable writing code in Groovy, Jython, JRuby, JavaScript, or Lua) to read in the configuration file, split the lines to build a map of parameter names to values, then write out a SQL statement with the names and/or values as you have done above.
... View more
09-19-2016
09:55 PM
1 Kudo
This is a known issue with the version of Hive (1.2.1) currently packaged with NiFi: https://issues.apache.org/jira/browse/NIFI-2575 caused by: https://issues.apache.org/jira/browse/HIVE-11581 The workaround is to not use Zookeeper for the service discovery.
... View more