About mburgess

mburgess · ‎11-14-2017

You don't need your own sys.path.append calls, you can just put the directories in a comma-separated list in the Module Directory property of ExecuteScript, and it will call sys.path.append for you. However, because it is Jython, if any of the imports (or any of their dependencies) are native CPython modules, then you won't be able to use them in ExecuteScript. All scripts and modules (and dependencies) must be pure Python. For your exact error, I'd have to see the script (where is "module" defined?), but I suspect that one of these libraries is not pure Python.

mburgess · ‎11-14-2017

Also, depending on what your stored procedure looks like, you may be able to use ExecuteSQL or PutSQL. However they do not support setting output parameters, and I'm not sure if they support input parameters. But if your procedure is hard-coded, then if it returns a ResultSet then ExecuteSQL should work, and if it doesn't, then PutSQL should work. Otherwise the above answer is the best bet.

mburgess · ‎11-14-2017

I'm not familiar with the innards of either Groovy or Jython, but I am guessing that Jython is slower for the following reasons: 1) Groovy was built "for the JVM" and leverages/integrates with Java more cleanly 2) Jython is an implementation of Python for the JVM. Looking briefly at the code, it appears to go back and forth between the Java and Python idioms, so it is more "emulated" than Groovy. 3) Apache Groovy has a large, very active community that consistently works to improve the performance of the code, both compiled and interpreted. In my own experience, Groovy and Javascript (Nashorn) perform much better in the scripted processors than Jython or JRuby. If you choose Jython, there are still a couple of things you can do to improve performance: - Use InvokeScriptedProcessor (ISP) instead of ExecuteScript. ISP is faster because it only loads the script once, then invokes methods on it, rather than ExecuteScript which evaluates the script each time. I have an ISP template in Jython which should make porting your ExecuteScript code easier. - Use ExecuteStreamCommand with command-line Python instead. You won't have the flexibility of accessing attributes, processor state, etc. but if you're just transforming content you should find ExecuteStreamCommand with Python faster. - No matter which language you choose, you can often improve performance if you use session.get(int) instead of session.get(). That way if there are a lot of flow files in the queue, you could call session.get(1000) or something, and process up to 1000 flow files per execution. If your script has a lot of overhead, you may find handling multiple flow files per execution can significantly improve performance.

mburgess · ‎11-13-2017

What do you mean by "add to this JSON a file that I get from FetchFTP"? Is the file you're fetching a JSON file, and you want to add fields to it? Are you Base64 encoding just the JSON from the attributes or the entire file after adding to it? If the incoming file (from FTP) is JSON, and you can get your attributes added to that flow file, then (as of NiFi 1.2.0 / HDF 3.0) you can use JoltTransformJSON to inject your individual attributes as fields into your JSON document (instead of AttributesToJSON). If you have too many attributes for that, your options are a bit more limited. In NiFi 1.3.0, you can use UpdateRecord to add the JSON from an attribute into a field in the other JSON document. You can also do this manually with ReplaceText. However one of the two JSON objects must be in an attribute. Whichever of the two (from AttributesToJSON or FetchFTP) is smaller, you can get that object first and use ExtractText to put the whole thing into an attribute. Note that attributes have limited size and introduce more memory usage, so beware of large JSON objects in attributes. However if one of them fits in a attribute, then you can use the UpdateRecord or ReplaceText processor as described. If you need to just encode one of the JSON objects, then if it is in an attribute you can use UpdateAttribute with the base64Encode Expression Language function, or if it is in content you can use the Base64EncodeContent processor.

mburgess · ‎11-10-2017

That's the thing, I'm not sure any NAR version will work. Are you using Apache Hive 2.3.0 or a vendor-specific version?

mburgess · ‎11-09-2017

The Hive processors in Apache NiFi 1.4.0 are built against Apache NiFi 1.2.1, so are not guaranteed to work with Apache Hive 2.3.0. If you are using the HDP platform, it has a Hive version based on 1.2.x but is closer to 2.0. Apache NiFi's Hive processors are not compatible with HDP 2.5+, you will likely want to use the NiFi-only Hortonworks Data Flow (HDF) package. This version of NiFi is built against the HDP Hive 1.2.x version. Having said that, HDF NiFi might work for your case, but it is also not guaranteed / supported to work against Hive 2.3.0 (whether Apache Hive or HDP Hive), as the baseline is still 1.2.x. The currently supported configuration is HDF NiFi against HDP Hive 1.2.x.

mburgess · ‎11-09-2017

If you can't set that property on the JDBC URL, then as of Apache NiFi 1.2.0 (HDF NiFi 3.0.x), due to NIFI-3426, you can add user-defined properties that will be passed to the connection. A list of these properties is available here, and includes the one you mention.

mburgess · ‎11-09-2017

It looks like the Solr service wants an array of JSON objects but you have a single JSON object, and also it expects the sharecount value to be nested under sharecount in a "set" field. If you know the attributes you need and there aren't very many of them (for example, there are only two you mention), you can use ReplaceText instead of AttributesToJSON, using Expression Language to hand-create the JSON array. The replacement text might look like this: [{"url": "${url}", "sharecount": {"set": ${sharecount}}}]

mburgess · ‎11-07-2017

These comments are spot-on, thanks! Also I'd mention if you want to dynamically customize it with incoming flow files, an alternative is to send your flow into GenerateTableFetch (on the primary node only, so your most upstream processor(s) will need to run on the primary node only). GenerateTableFetch (GTF) is like QueryDatabaseTable (QDT), with the big differences being 1) GTF takes incoming flow files, and 2) QDT executes the SQL it generates internally, where GTF sends the SQL out as flow files so some other processor (ExecuteSQL, e.g.) can execute it. This can be used by sending the SQL output from GTF to a Remote Process Group (RPG) pointed at an Input Port on your same cluster. This RPG -> Input Port pattern is used to distribute the flow files among the nodes in the cluster, rather than every node working on the same data (which leads to data duplication as @Abdelkrim Hadjidj mentions above. Downstream from the Input Port, all nodes are processing their subset of the flow files in parallel, so you can send Input Port -> ExecuteSQL. This flow is basically a parallel, distributed version of what QueryDatabaseTable does on one node.

mburgess · ‎11-07-2017

Although you may not see the same performance gains from ISP using Jython as you would by using Groovy (Jython is slower in general), this is still a good idea, so I revisited my blog post and created an ISP template in Jython. Please let me know if it works for you!

Online	Offline
Last Visited	‎10-29-2025 03:45 PM

Member Since	‎11-16-2015 02:21 PM
Last Visited	‎10-29-2025 03:45 PM
Posts	905
Kudos received	658

Cloudera Community

Re: Compare data within the JSON using NIFI

Re: how to join three csv files like sql on condit...

Re: How to see the Data Provenance and Lineage in ...

Re: Apache NiFi - RouteText has no matches

Re: Nifi Building error when creating a brand new ...

Re: org.apache.nifi.processor.exception.ProcessExc...

Re: I am Trying to invoke a (mysql)stored procedur...

Re: Performance of Python Script in NiFi is slower...

Re: base64 encoded file to JSON string

Re: Nifi Error on PutHiveStreaming

Re: Nifi Error on PutHiveStreaming

Re: Apache NiFi Database Connection Pool with Amaz...

Re: JSON message body for invoke http processor

Re: QueryDatabaseTable internal workings

Re: InvokeScriptedProcessor template for jython?