Member since
11-16-2015
894
Posts
657
Kudos Received
245
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
7565 | 02-22-2024 12:38 PM | |
1632 | 02-02-2023 07:07 AM | |
3528 | 12-07-2021 09:19 AM | |
4545 | 03-20-2020 12:34 PM | |
15486 | 01-27-2020 07:57 AM |
04-29-2025
12:37 PM
3 Kudos
With the addition of Python to Apache NiFi 2.0 and Cloudera Flow Management 4, developers now have a new way to rapidly develop processors and can leverage Python libraries for FlowFile processing, which is an exciting new feature. But what about NiFi’s scripting capabilities using Jython? In this post I will explore the history of Jython and Python in the Apache NiFi project, their capabilities, and the current state of Python and Jython in both Apache NiFi 2 and Cloudera Flow Management 4. NiFi’s Scripting Capabilities: Apache NiFi 0.5.0 introduced the ExecuteScript processor, allowing developers to interact with the NiFi SDK using several JVM-based scripting languages such as Groovy, Javascript, Lua, Clojure, and Jython, a JVM-based version of Python. The Jython language is compatible with Python, so any script you write in pure Python should work in ExecuteScript using Jython. “Pure” Python means the entire code is written in the Python language rather than other languages that Python has bindings for such as C. This distinction is important because many popular Python libraries such as pandas and scikit include components written in C and are NOT pure Python. This means that even though you can write a Python program importing the pandas library (for example) that program/script will not run in Jython. This limitation significantly narrows the scope of what you can accomplish with Jython in ExecuteScript. A workaround for this is to use the ExecuteStreamCommand processor to run a Python script file using the actual Python interpreter. However, this method comes with several important constraints. The script can only receive FlowFile content via standard input (stdin) and must return FlowFile content using standard out (stdout). You have to bring your own Python interpreter. Access to the rest of the NiFi processor API, including FlowFile attributes, creating multiple FlowFiles, and multiple commits to the session are not available via ExecuteStreamCommand, which can severely limit the operations you can perform on your FlowFiles. Because of these limitations, ExecuteStreamCommand should be your last resort for executing Python programs, used only when there is no other way to achieve the goal with Jython or native Python processors in Cloudera Flow Management 4. Jython support in scripting components was deprecated in Apache NiFi 1.x and fully removed in NiFi 2.0. However, Cloudera Flow Management 4 has retained Jython support, ensuring continued compatibility for existing flows. This means if you have Jython scripts in the scripting components in NiFi 1, you will not be able to migrate to Apache NiFi 2.x. This capability has only been restored in Cloudera Flow Management 4. Python in Apache NiFi 2.0: The generally available (GA) release of Apache NiFi 2.0 in November 2024 introduced a number of new features not present in the NiFi 1.x line. Perhaps the most notable feature is the addition of the Python Processor Software Development Kit (SDK). The Python SDK in NiFi allows developers to write processors in Python and leverage CPython libraries such as pandas and scikit. Given Python’s popularity for Artificial Intelligence (AI) applications, the Python SDK in NiFi enables users to create processors that can use AI libraries to do data processing. NiFi 2.0 comes with a handful of processors written in Python, including ones for data parsing and VectorDB integrations with systems like Chroma, Pinecone, and Qdrant, commonly used for Retrieval Augmented Generation (RAG) use cases involving document processing, storing, and semantic search. Cloudera Flow Management 4 adds many additional Python processors that are not available in open-source Apache NiFi 2.0, including: Chunk/EmbedData: Allows the user to chunk their unstructured data then embed the data into vectors PromptBedrock: Provides integration with Amazon Bedrock LexicalQuery/InsertTo/VectorQueryMilvus: Provides integration to/from the Milvus vector database PromptChatGPT: Allows the user to send prompts to ChatGPT and process the results PartitionText/Csv/Docx/Pdf/Html: Offers a wider range of structured and unstructured document formats that can be partitioned for later chunking and embedding, increasing the quality and relevance of prompt responses These processors rely on CPython libraries and cannot be implemented in Jython. Jython and Python in Cloudera Flow Management 4: Because Jython is based on Python 2.x and is no longer actively maintained, the Apache NiFi community removed the Jython library from the NiFi ARchive (NAR) containing the scripting components. The upcoming GA release of Cloudera Flow Management 4 adds Jython back into the scripting NAR. This allows Cloudera customers with existing Jython scripts to continue running these scripts without modification. This is just one example how Cloudera supports backward compatibility for NiFi flows. Many components removed from NiFi 2.0 are retained in Cloudera Flow Management 4 such as Hive, HBase, Atlas, Ranger, Couchbase 2, Cassandra 3, Kafka 2.6, and Kudu. This ensures that Cloudera customers who rely on these features can upgrade without forcing unnecessary changes to their flows. The NiFi Python SDK: Although the introduction of the Python SDK enables developers to implement processors in Python, the SDK was designed for simplicity and does not allow access to the full NiFi Processor APIs. There are limitations that must be considered when designing a processor: Python processors do not have access to the ProcessSession, so they cannot create multiple FlowFiles, commit/rollback multiple session commits, merge FlowFiles, etc. Only three types of processors are currently supported in the Python SDK: FlowFileSource, FlowFileTransform, and RecordTransform. A FlowFileSource processor does not allow incoming connections and expects a single FlowFile as output. This can be used as a source processor to generate sample data for example. A FlowFileTransform processor receives a single incoming FlowFile and expects a single FlowFile as output. This would be used when modifying the entire content of a FlowFile such as converting binary image data from JPG to PNG. A RecordTransform processor also receives a single incoming FlowFile and expects a single FlowFile as output, but it can process individual records in the incoming data via a configured RecordReader controller service. This would be used to transform each record individually such as changing the value of a field in each record. These limitations will affect users in several ways. First, only NiFi Processors can be scripted in Python, not Controller Services or Reporting Tasks. Next, a Python processor can work with at most one FlowFile per execution. This limits what a processor can do in terms of creating/cloning multiple FlowFiles, ingesting a batch of FlowFiles, and handling lifecycle events such as onPropertyModified(). When to use Jython or Python? Choosing between Jython (via the scripted components such as ExecuteScript) or the Python SDK for processors in Cloudera Flow Management 4 depends on your use case and requirements. Here is a quick guide to help with decision making: Use Jython when Use Python when You already have existing Jython scripts in your flows You want to import a CPython library such as pandas You want to implement scripted controller services such as ScriptedRecordReader You are more familiar with Python than other supported scripting languages You need control over multiple session transactions (i.e. commits/rollbacks) and/or Flowfiles You only need to work with one FlowFile at a time Jython and Python capabilities in Cloudera Flow Management offer several ways to use the Python programming language and its libraries to rapidly develop NiFi components. Leveraging Jython in scripted components offer a rich set of features for those developers more comfortable with the Python language, and using Python libraries to develop NiFi processors enables more integrations with external systems such as GenAI applications. The combination of these in Cloudera Flow Management presents a powerful environment for which to enhance and enrich your flows to achieve a higher level of business success. Learn More: To explore the new capabilities of Cloudera Flow Management 4 and discover how it can transform your data pipelines with native Python support, watch our The Five Things You Need To Know about Unlocking The Power of NiFi 2.0 webinar . For a full list of connectors in Cloudera Flow Management, visit the Cloudera Connectors Library.
... View more
Labels:
03-21-2025
09:57 AM
This may be related to https://issues.apache.org/jira/browse/NIFI-11783 which was fixed in CFM 2.1.6. Is upgrading your CFM an option?
... View more
09-12-2024
05:43 AM
If you are only building a custom NAR, you should be able to use one of the Gradle plugins from the list that was posted here. Building all of NiFi as a Gradle project may not be quite as feasible, but I believe a 'gradle init' can (almost?) convert the project into a Gradle build. You may also want to look at Gradle plugins that are similar to the Maven plugins we use for the project such as identifying vulnerabilites in dependencies, checkstyle (if you have such a requirement), etc.
... View more
08-13-2024
07:34 AM
Is the incoming FlowFile an Avro file? Or is it JSON or something else?
... View more
04-11-2024
10:31 AM
1 Kudo
How is your ListSFTP processor configured? Can you increase the Entity Tracking Window to include the difference in timezones?
... View more
03-19-2024
12:58 PM
The post you linked to has the command-line version of the Groovy script, it's based on my blog post at: https://funnifi.blogspot.com/2016/04/inspecting-your-nifi.html. You can put that code in your ExecuteScript (or InvokeScriptedProcessor in the onTrigger()). It might need slight alterations but should be pretty close. If you want to write them to a FlowFile you'll have to add that to the script as well.
... View more
02-23-2024
11:51 AM
1 Kudo
https://issues.apache.org/jira/browse/NIFI-12839
... View more
02-22-2024
12:38 PM
I think this is a bug, I'm looking into it. Looks like it's using your 1.0-SNAPSHOT version for NiFi dependencies when it should be using nifiVersion. To get things going, use 2.0.0-M2 for your version for now, then you can go back into the helloworld module and change the version back to what you want.
... View more
06-13-2023
02:59 PM
1 Kudo
Tracking here: https://issues.apache.org/jira/browse/NIFI-11682
... View more