I also found this and other answers from Matt very useful. Thank you. I have a legacy Python script collection that processes binary files along a work-flow that I would like to speed up considerably, and I am curious about the possibility of doing this on a compute cluster with NiFi. I would like to use these scripts to extract data from the binary files and then ingest the data into a suitable database such as Spark SQL or Hive for machine learning. Right now the ExecuteScript processor looks the most suitable for using my Python scripts in, and after some time working with it I am still having various challenges in getting things to work. Some particulars: My scripts use Python2 I installed the Python libraries into the HWX HDP sandbox environment using Anaconda (/usr/local/anaconda2/) I normally run my code from within virtualenvironment, so that I might maintain Python dependencies for my project without affecting that needed for the Operating System. I need to import various Python modules for my scripts to work, and am
not sure if all of those I need would be pure Python. Here is a list of
things I typically import: commands, database, datetime, java.io,
java.nio.charset, math, multiprocessing, mysql.connector, MySQLdb,
numpy, org.apache.commons.io, org.apache.nifi.processor.io, os, pstats,
scipy, sys, time, and there might be more coming. My Questions: Is there an additional overhead/speed penalty incurred (and how much) when running Python scripts in the ExecuteScript processor (as opposed to running natively outside NiFi) through the Jython engine? Are there any plans to have a non-Jython engine to execute the Python scripts, or any other workarounds for cases where non-pure Python modules might need to be imported? How could I let my Python scripts inside ExecuteScript use the correct version of Python? Thank you!
... View more