About mburgess

mburgess · ‎10-03-2025

Welcome! In this series of articles I will address various topics about Cloudera Flow Management (powered by Apache NiFi) in terms of Best Practices about how, when, and why to use the powerful and flexible features of Cloudera Flow Management (CFM) / Apache NiFi. I will use the terms CFM and NiFi interchangeably but will specify when something is specific to Cloudera Flow Management and does not apply to Apache NiFi. These examples and screenshots will be taken both from CFM 2 (powered by Apache NiFi 1.x) and CFM 4 (powered by Apache NiFi 2.x) when prudent to illustrate the features and Best Practices that are common to both CFM 2 and 4, as well as when something is specific to CFM 4 / NiFi 2. Links will be present for articles that have been published. This Best Practices Cookbook will feature the following articles, but please like/subscribe as articles may be added and/or edited at various times. Also if you have any ideas on Best Practices articles that may be helpful to the community, please leave them in the comments below. Happy Flowing! Best Practices Cookbook Parameterize All The Things! (part 1 - Parameters and Parameter Contexts) Parameterize All The Things! (part 2 - Parameterizing Controller Service References) Parameterize All The Things! (part 3 - Parameter Providers) Flow Analysis Rules (enforcing Best Practices in Flow Design) Backpressure (Maximize throughput) Schema Drift (what to do when your source data structures change) Unlocking the power of Registry (Version Control, Flow Development Lifecycle, etc.)

mburgess · ‎10-03-2025

Do you need the script or can you just configure a DBCPConnectionPool to point at the Impala JARs and the connection URL? Also that version may have an ImpalaConnectionPool which already includes the driver and is easier to use to configure a connection to Impala.

mburgess · ‎09-30-2025

You can use ExecuteGroovyScript with the following script: def ff = session.get() if (!ff) return def obj = new groovy.json.JsonSlurper().parse(ff.read()) def outObj = [] // Find updated records def old_ids = obj.old_orders.collect {it.order_id} def latest_ids = obj.latest_orders.collect {it.order_id} old_ids.intersect(latest_ids).each {order_id -> def update_order = obj.latest_orders.find {it.order_id == order_id} update_order.Action = 'UPDATE' outObj += update_order } // Find deleted records (old_ids - latest_ids).each {order_id -> def delete_order = obj.old_orders.find {it.order_id == order_id} delete_order.Action = 'DELETE' outObj += delete_order } // Find new records (latest_ids - old_ids).each {order_id -> def new_order = obj.latest_orders.find {it.order_id == order_id} new_order.Action = 'NEW' outObj += new_order } ff.write('UTF-8', groovy.json.JsonOutput.toJson(outObj)) REL_SUCCESS << ff

mburgess · ‎09-20-2025

If "Infer Schema" isn't working this is likely a bug. Could you provide an example JSON and the error message that happens during schema inference?

mburgess · ‎06-26-2025

If you know where the CSV files are on the filesystem and the condition is simple, you may be able to start with CSV file 1 then use 2 LookupRecord processors in sequence with 2 CSVRecordLookupService controller services (each pointing at CSV file 2 and 3 respectively). If that doesn't suit your needs, check out the ForkEnrichment and JoinEnrichment processors, they may be able to do what you need.

mburgess · ‎06-19-2025

Provenance/lineage is not currently visible from the Flow Designer. This is intended because the Flow Designer UI is for flow design regardless of whether there is a Test Session or deployment active. Provenance and lineage is associated with actual data running through a deployment, so to view these you'll need to navigate to the Cloudera Flow Management (NiFi) canvas from the deployment view once your flow has been deployed. From the canvas you can proceed as the video instructs and hopefully it looks familiar at that point.

mburgess · ‎05-30-2025

The strategy "Matches Regular Expression" intends to match the entire line, but your regex only matches the first two characters. The regex "^(G;).*" will match the entire line.

mburgess · ‎04-29-2025

With the addition of Python to Apache NiFi 2.0 and Cloudera Flow Management 4, developers now have a new way to rapidly develop processors and can leverage Python libraries for FlowFile processing, which is an exciting new feature. But what about NiFi’s scripting capabilities using Jython? In this post I will explore the history of Jython and Python in the Apache NiFi project, their capabilities, and the current state of Python and Jython in both Apache NiFi 2 and Cloudera Flow Management 4. NiFi’s Scripting Capabilities: Apache NiFi 0.5.0 introduced the ExecuteScript processor, allowing developers to interact with the NiFi SDK using several JVM-based scripting languages such as Groovy, Javascript, Lua, Clojure, and Jython, a JVM-based version of Python. The Jython language is compatible with Python, so any script you write in pure Python should work in ExecuteScript using Jython. “Pure” Python means the entire code is written in the Python language rather than other languages that Python has bindings for such as C. This distinction is important because many popular Python libraries such as pandas and scikit include components written in C and are NOT pure Python. This means that even though you can write a Python program importing the pandas library (for example) that program/script will not run in Jython. This limitation significantly narrows the scope of what you can accomplish with Jython in ExecuteScript. A workaround for this is to use the ExecuteStreamCommand processor to run a Python script file using the actual Python interpreter. However, this method comes with several important constraints. The script can only receive FlowFile content via standard input (stdin) and must return FlowFile content using standard out (stdout). You have to bring your own Python interpreter. Access to the rest of the NiFi processor API, including FlowFile attributes, creating multiple FlowFiles, and multiple commits to the session are not available via ExecuteStreamCommand, which can severely limit the operations you can perform on your FlowFiles. Because of these limitations, ExecuteStreamCommand should be your last resort for executing Python programs, used only when there is no other way to achieve the goal with Jython or native Python processors in Cloudera Flow Management 4. Jython support in scripting components was deprecated in Apache NiFi 1.x and fully removed in NiFi 2.0. However, Cloudera Flow Management 4 has retained Jython support, ensuring continued compatibility for existing flows. This means if you have Jython scripts in the scripting components in NiFi 1, you will not be able to migrate to Apache NiFi 2.x. This capability has only been restored in Cloudera Flow Management 4. Python in Apache NiFi 2.0: The generally available (GA) release of Apache NiFi 2.0 in November 2024 introduced a number of new features not present in the NiFi 1.x line. Perhaps the most notable feature is the addition of the Python Processor Software Development Kit (SDK). The Python SDK in NiFi allows developers to write processors in Python and leverage CPython libraries such as pandas and scikit. Given Python’s popularity for Artificial Intelligence (AI) applications, the Python SDK in NiFi enables users to create processors that can use AI libraries to do data processing. NiFi 2.0 comes with a handful of processors written in Python, including ones for data parsing and VectorDB integrations with systems like Chroma, Pinecone, and Qdrant, commonly used for Retrieval Augmented Generation (RAG) use cases involving document processing, storing, and semantic search. Cloudera Flow Management 4 adds many additional Python processors that are not available in open-source Apache NiFi 2.0, including: Chunk/EmbedData: Allows the user to chunk their unstructured data then embed the data into vectors PromptBedrock: Provides integration with Amazon Bedrock LexicalQuery/InsertTo/VectorQueryMilvus: Provides integration to/from the Milvus vector database PromptChatGPT: Allows the user to send prompts to ChatGPT and process the results PartitionText/Csv/Docx/Pdf/Html: Offers a wider range of structured and unstructured document formats that can be partitioned for later chunking and embedding, increasing the quality and relevance of prompt responses These processors rely on CPython libraries and cannot be implemented in Jython. Jython and Python in Cloudera Flow Management 4: Because Jython is based on Python 2.x and is no longer actively maintained, the Apache NiFi community removed the Jython library from the NiFi ARchive (NAR) containing the scripting components. The upcoming GA release of Cloudera Flow Management 4 adds Jython back into the scripting NAR. This allows Cloudera customers with existing Jython scripts to continue running these scripts without modification. This is just one example how Cloudera supports backward compatibility for NiFi flows. Many components removed from NiFi 2.0 are retained in Cloudera Flow Management 4 such as Hive, HBase, Atlas, Ranger, Couchbase 2, Cassandra 3, Kafka 2.6, and Kudu. This ensures that Cloudera customers who rely on these features can upgrade without forcing unnecessary changes to their flows. The NiFi Python SDK: Although the introduction of the Python SDK enables developers to implement processors in Python, the SDK was designed for simplicity and does not allow access to the full NiFi Processor APIs. There are limitations that must be considered when designing a processor: Python processors do not have access to the ProcessSession, so they cannot create multiple FlowFiles, commit/rollback multiple session commits, merge FlowFiles, etc. Only three types of processors are currently supported in the Python SDK: FlowFileSource, FlowFileTransform, and RecordTransform. A FlowFileSource processor does not allow incoming connections and expects a single FlowFile as output. This can be used as a source processor to generate sample data for example. A FlowFileTransform processor receives a single incoming FlowFile and expects a single FlowFile as output. This would be used when modifying the entire content of a FlowFile such as converting binary image data from JPG to PNG. A RecordTransform processor also receives a single incoming FlowFile and expects a single FlowFile as output, but it can process individual records in the incoming data via a configured RecordReader controller service. This would be used to transform each record individually such as changing the value of a field in each record. These limitations will affect users in several ways. First, only NiFi Processors can be scripted in Python, not Controller Services or Reporting Tasks. Next, a Python processor can work with at most one FlowFile per execution. This limits what a processor can do in terms of creating/cloning multiple FlowFiles, ingesting a batch of FlowFiles, and handling lifecycle events such as onPropertyModified(). When to use Jython or Python? Choosing between Jython (via the scripted components such as ExecuteScript) or the Python SDK for processors in Cloudera Flow Management 4 depends on your use case and requirements. Here is a quick guide to help with decision making: Use Jython when Use Python when You already have existing Jython scripts in your flows You want to import a CPython library such as pandas You want to implement scripted controller services such as ScriptedRecordReader You are more familiar with Python than other supported scripting languages You need control over multiple session transactions (i.e. commits/rollbacks) and/or Flowfiles You only need to work with one FlowFile at a time Jython and Python capabilities in Cloudera Flow Management offer several ways to use the Python programming language and its libraries to rapidly develop NiFi components. Leveraging Jython in scripted components offer a rich set of features for those developers more comfortable with the Python language, and using Python libraries to develop NiFi processors enables more integrations with external systems such as GenAI applications. The combination of these in Cloudera Flow Management presents a powerful environment for which to enhance and enrich your flows to achieve a higher level of business success. Learn More: To explore the new capabilities of Cloudera Flow Management 4 and discover how it can transform your data pipelines with native Python support, watch our The Five Things You Need To Know about Unlocking The Power of NiFi 2.0 webinar . For a full list of connectors in Cloudera Flow Management, visit the Cloudera Connectors Library.

mburgess · ‎03-21-2025

This may be related to https://issues.apache.org/jira/browse/NIFI-11783 which was fixed in CFM 2.1.6. Is upgrading your CFM an option?

mburgess · ‎09-12-2024

If you are only building a custom NAR, you should be able to use one of the Gradle plugins from the list that was posted here. Building all of NiFi as a Gradle project may not be quite as feasible, but I believe a 'gradle init' can (almost?) convert the project into a Gradle build. You may also want to look at Gradle plugins that are similar to the Maven plugins we use for the project such as identifying vulnerabilites in dependencies, checkstyle (if you have such a requirement), etc.

Online	Offline
Last Visited	‎03-13-2026 10:31 AM

Member Since	‎11-16-2015 02:21 PM
Last Visited	‎03-13-2026 10:31 AM
Posts	911
Kudos received	664

Cloudera Community

Re: Compare data within the JSON using NIFI

Re: how to join three csv files like sql on condit...

Re: How to see the Data Provenance and Lineage in ...

Re: Apache NiFi - RouteText has no matches

Re: Nifi Building error when creating a brand new ...

Cloudera Flow Management / Apache NiFi Best Practi...

Re: Nifi Unable to use Impala connection in the ex...

Re: Compare data within the JSON using NIFI

Re: Trouble Indexing data to elasticsearch using N...

Re: how to join three csv files like sql on condit...

Re: How to see the Data Provenance and Lineage in ...

Re: Apache NiFi - RouteText has no matches

Python and Jython in Cloudera Flow Management

Re: The InvokeHttp Processor's 'Retry' option is c...

Re: Has anyone gotten Custom NiFi Processor Projec...