Member since
11-16-2015
902
Posts
664
Kudos Received
249
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
85 | 09-30-2025 05:23 AM | |
567 | 06-26-2025 01:21 PM | |
414 | 06-19-2025 02:48 PM | |
662 | 05-30-2025 01:53 PM | |
9607 | 02-22-2024 12:38 PM |
10-03-2025
04:00 PM
1 Kudo
This article is part of the Cloudera Flow Management / Apache NiFi Best Practices Cookbook series and will discuss Best Practices for using Parameters and Parameter Contexts in your flows. History First a little background about Parameters: In NiFi 1.x, the original solution for being able to configure various environments (development vs production, e.g.) was Variables. This was a global key/value store of variable names with values associated with them. These variable names could be used in Expression Language expressions as well as certain properties in processors looking specifically for a variable name. This single, global list was not flexible enough to solve many of the configuration challenges of our users. To solve this, Parameters were introduced. They are meant as a replacement for Variables and via Parameter Contexts are much more flexible and powerful due to inheritance (which I will discuss later). Parameters In general, a parameter is referred to by the following syntax in properties and expressions: #{My Parameter} where the parameter is named "My Parameter", versus a variable expression for "my.variable" which is as follows: ${my.variable} Note that the syntax for parameters uses the "#" symbol where the syntax for parameters shares the Expression Language syntax using the "$" symbol. Naming rules are also different; for example parameter names can have spaces where variables cannot. You can use a parameter value alongside an Expression Language expression, such as: #{My Parameter}.${literal('test')} or just #{My Parameter}.test If the value of "My Parameter" is "this.is.a" then the output in both cases is "this.is.a.test". The real power of parameters comes from grouping them and creating a hierarchy of those groups. Parameter groups are called Parameter Contexts, and a Parameter Context can be set on a Process Group (including the root process group). For a Process Group with a given Parameter Context, this means anywhere in the Process Group you can refer to its Parameters in any of the components. To create a Parameter Context, go to the hamburger menu in the top-right corner of the UI and select Parameter Contexts. You can then select the + button to add a new one. You give it a name: Then on the Parameters tab you can select the + button to add a new parameter giving it a name and a value and specifying whether that value is sensitive (such as a password) or not: Now that we have our Parameters in our Parameter Context, we can assign that to our example Process Group on the canvas on its configuration menu: Note the checkbox to "Apply recursively". If this is not checked, only the components inside the Process Group itself have access to the Parameters in the context. If you want all child Process Groups to recursively have access to the Parameters "down the tree", check this box. Apply the changes and we're ready to use this Parameter Context! Overriding Parameter Values Now imagine inside one of the child Process Groups, we want a different value for Username and Password. If we have applied the parent Process Group's Parameter Context, we will be using the parent's values for Username and Password. To override, we need a new Parameter Context, let's call it "Override Dev Process Group for Child PG 1". We then add Username and Password parameters with the different values (I chose "jsmith" as the Username and "joe's password" for Password): Once this is applied, you select the new Parameter Context to be used by the aforementioned Child Process Group 1: Looking at Child Process Group 2, you can see it has the parent's Parameter Context: This is how you override parameters inside a child Process Group, while allowing other child process groups to use the parent's parameter values. Swapping Parameter Contexts This "overriding" concept can be applied to the parent process group. For example, let's say we want to run the same flow in the Production environment. You could edit the parameter values, but then the flow won't run on the Development system. The correct way to do this is to create a new process context for the production environment, setting the parameters and their values (in this case "mburgess" remains the Username and "prod_p@ssword" for Password: When we want the flow to run in the production environment, we only need to swap the process context from Dev to Prod on the parent Process Group: In this way you can run the same flow in multiple environments. Summary In this article I presented the history of Variables and Parameters, then discussed Parameter Contexts as groups of parameters. Using the inheritance and overriding features, I illustrated how parameter values can be overridden in specific child Process Groups, and otherwise inherited from the parent Process Group. In the next article, I will extend this concept even more. Not only can you use parameters in processor properties for configuration, but you can use parameters to reference Controller Services so you can swap and/or override actual Controller Service implementations!
... View more
Labels:
10-03-2025
04:00 PM
1 Kudo
Welcome! In this series of articles I will address various topics about Cloudera Flow Management (powered by Apache NiFi) in terms of Best Practices about how, when, and why to use the powerful and flexible features of Cloudera Flow Management (CFM) / Apache NiFi. I will use the terms CFM and NiFi interchangeably but will specify when something is specific to Cloudera Flow Management and does not apply to Apache NiFi. These examples and screenshots will be taken both from CFM 2 (powered by Apache NiFi 1.x) and CFM 4 (powered by Apache NiFi 2.x) when prudent to illustrate the features and Best Practices that are common to both CFM 2 and 4, as well as when something is specific to CFM 4 / NiFi 2. Links will be present for articles that have been published. This Best Practices Cookbook will feature the following articles, but please like/subscribe as articles may be added and/or edited at various times. Also if you have any ideas on Best Practices articles that may be helpful to the community, please leave them in the comments below. Happy Flowing! Best Practices Cookbook Parameterize All The Things! (part 1 - Parameters and Parameter Contexts) Parameterize All The Things! (part 2 - Parameterizing Controller Service References) Parameterize All The Things! (part 3 - Parameter Providers) Backpressure (Maximize throughput) Schema Drift (what to do when your source data structures change) Flow Analysis Rules (enforcing Best Practices in Flow Design) Unlocking the power of Registry (Version Control, Flow Development Lifecycle, etc.)
... View more
Labels:
10-03-2025
11:00 AM
Do you need the script or can you just configure a DBCPConnectionPool to point at the Impala JARs and the connection URL? Also that version may have an ImpalaConnectionPool which already includes the driver and is easier to use to configure a connection to Impala.
... View more
09-30-2025
05:23 AM
2 Kudos
You can use ExecuteGroovyScript with the following script: def ff = session.get() if (!ff) return def obj = new groovy.json.JsonSlurper().parse(ff.read()) def outObj = [] // Find updated records def old_ids = obj.old_orders.collect {it.order_id} def latest_ids = obj.latest_orders.collect {it.order_id} old_ids.intersect(latest_ids).each {order_id -> def update_order = obj.latest_orders.find {it.order_id == order_id} update_order.Action = 'UPDATE' outObj += update_order } // Find deleted records (old_ids - latest_ids).each {order_id -> def delete_order = obj.old_orders.find {it.order_id == order_id} delete_order.Action = 'DELETE' outObj += delete_order } // Find new records (latest_ids - old_ids).each {order_id -> def new_order = obj.latest_orders.find {it.order_id == order_id} new_order.Action = 'NEW' outObj += new_order } ff.write('UTF-8', groovy.json.JsonOutput.toJson(outObj)) REL_SUCCESS << ff
... View more
09-20-2025
09:15 AM
If "Infer Schema" isn't working this is likely a bug. Could you provide an example JSON and the error message that happens during schema inference?
... View more
06-26-2025
01:21 PM
If you know where the CSV files are on the filesystem and the condition is simple, you may be able to start with CSV file 1 then use 2 LookupRecord processors in sequence with 2 CSVRecordLookupService controller services (each pointing at CSV file 2 and 3 respectively). If that doesn't suit your needs, check out the ForkEnrichment and JoinEnrichment processors, they may be able to do what you need.
... View more
06-19-2025
02:48 PM
1 Kudo
Provenance/lineage is not currently visible from the Flow Designer. This is intended because the Flow Designer UI is for flow design regardless of whether there is a Test Session or deployment active. Provenance and lineage is associated with actual data running through a deployment, so to view these you'll need to navigate to the Cloudera Flow Management (NiFi) canvas from the deployment view once your flow has been deployed. From the canvas you can proceed as the video instructs and hopefully it looks familiar at that point.
... View more
05-30-2025
01:53 PM
1 Kudo
The strategy "Matches Regular Expression" intends to match the entire line, but your regex only matches the first two characters. The regex "^(G;).*" will match the entire line.
... View more
04-29-2025
12:37 PM
3 Kudos
With the addition of Python to Apache NiFi 2.0 and Cloudera Flow Management 4, developers now have a new way to rapidly develop processors and can leverage Python libraries for FlowFile processing, which is an exciting new feature. But what about NiFi’s scripting capabilities using Jython? In this post I will explore the history of Jython and Python in the Apache NiFi project, their capabilities, and the current state of Python and Jython in both Apache NiFi 2 and Cloudera Flow Management 4. NiFi’s Scripting Capabilities: Apache NiFi 0.5.0 introduced the ExecuteScript processor, allowing developers to interact with the NiFi SDK using several JVM-based scripting languages such as Groovy, Javascript, Lua, Clojure, and Jython, a JVM-based version of Python. The Jython language is compatible with Python, so any script you write in pure Python should work in ExecuteScript using Jython. “Pure” Python means the entire code is written in the Python language rather than other languages that Python has bindings for such as C. This distinction is important because many popular Python libraries such as pandas and scikit include components written in C and are NOT pure Python. This means that even though you can write a Python program importing the pandas library (for example) that program/script will not run in Jython. This limitation significantly narrows the scope of what you can accomplish with Jython in ExecuteScript. A workaround for this is to use the ExecuteStreamCommand processor to run a Python script file using the actual Python interpreter. However, this method comes with several important constraints. The script can only receive FlowFile content via standard input (stdin) and must return FlowFile content using standard out (stdout). You have to bring your own Python interpreter. Access to the rest of the NiFi processor API, including FlowFile attributes, creating multiple FlowFiles, and multiple commits to the session are not available via ExecuteStreamCommand, which can severely limit the operations you can perform on your FlowFiles. Because of these limitations, ExecuteStreamCommand should be your last resort for executing Python programs, used only when there is no other way to achieve the goal with Jython or native Python processors in Cloudera Flow Management 4. Jython support in scripting components was deprecated in Apache NiFi 1.x and fully removed in NiFi 2.0. However, Cloudera Flow Management 4 has retained Jython support, ensuring continued compatibility for existing flows. This means if you have Jython scripts in the scripting components in NiFi 1, you will not be able to migrate to Apache NiFi 2.x. This capability has only been restored in Cloudera Flow Management 4. Python in Apache NiFi 2.0: The generally available (GA) release of Apache NiFi 2.0 in November 2024 introduced a number of new features not present in the NiFi 1.x line. Perhaps the most notable feature is the addition of the Python Processor Software Development Kit (SDK). The Python SDK in NiFi allows developers to write processors in Python and leverage CPython libraries such as pandas and scikit. Given Python’s popularity for Artificial Intelligence (AI) applications, the Python SDK in NiFi enables users to create processors that can use AI libraries to do data processing. NiFi 2.0 comes with a handful of processors written in Python, including ones for data parsing and VectorDB integrations with systems like Chroma, Pinecone, and Qdrant, commonly used for Retrieval Augmented Generation (RAG) use cases involving document processing, storing, and semantic search. Cloudera Flow Management 4 adds many additional Python processors that are not available in open-source Apache NiFi 2.0, including: Chunk/EmbedData: Allows the user to chunk their unstructured data then embed the data into vectors PromptBedrock: Provides integration with Amazon Bedrock LexicalQuery/InsertTo/VectorQueryMilvus: Provides integration to/from the Milvus vector database PromptChatGPT: Allows the user to send prompts to ChatGPT and process the results PartitionText/Csv/Docx/Pdf/Html: Offers a wider range of structured and unstructured document formats that can be partitioned for later chunking and embedding, increasing the quality and relevance of prompt responses These processors rely on CPython libraries and cannot be implemented in Jython. Jython and Python in Cloudera Flow Management 4: Because Jython is based on Python 2.x and is no longer actively maintained, the Apache NiFi community removed the Jython library from the NiFi ARchive (NAR) containing the scripting components. The upcoming GA release of Cloudera Flow Management 4 adds Jython back into the scripting NAR. This allows Cloudera customers with existing Jython scripts to continue running these scripts without modification. This is just one example how Cloudera supports backward compatibility for NiFi flows. Many components removed from NiFi 2.0 are retained in Cloudera Flow Management 4 such as Hive, HBase, Atlas, Ranger, Couchbase 2, Cassandra 3, Kafka 2.6, and Kudu. This ensures that Cloudera customers who rely on these features can upgrade without forcing unnecessary changes to their flows. The NiFi Python SDK: Although the introduction of the Python SDK enables developers to implement processors in Python, the SDK was designed for simplicity and does not allow access to the full NiFi Processor APIs. There are limitations that must be considered when designing a processor: Python processors do not have access to the ProcessSession, so they cannot create multiple FlowFiles, commit/rollback multiple session commits, merge FlowFiles, etc. Only three types of processors are currently supported in the Python SDK: FlowFileSource, FlowFileTransform, and RecordTransform. A FlowFileSource processor does not allow incoming connections and expects a single FlowFile as output. This can be used as a source processor to generate sample data for example. A FlowFileTransform processor receives a single incoming FlowFile and expects a single FlowFile as output. This would be used when modifying the entire content of a FlowFile such as converting binary image data from JPG to PNG. A RecordTransform processor also receives a single incoming FlowFile and expects a single FlowFile as output, but it can process individual records in the incoming data via a configured RecordReader controller service. This would be used to transform each record individually such as changing the value of a field in each record. These limitations will affect users in several ways. First, only NiFi Processors can be scripted in Python, not Controller Services or Reporting Tasks. Next, a Python processor can work with at most one FlowFile per execution. This limits what a processor can do in terms of creating/cloning multiple FlowFiles, ingesting a batch of FlowFiles, and handling lifecycle events such as onPropertyModified(). When to use Jython or Python? Choosing between Jython (via the scripted components such as ExecuteScript) or the Python SDK for processors in Cloudera Flow Management 4 depends on your use case and requirements. Here is a quick guide to help with decision making: Use Jython when Use Python when You already have existing Jython scripts in your flows You want to import a CPython library such as pandas You want to implement scripted controller services such as ScriptedRecordReader You are more familiar with Python than other supported scripting languages You need control over multiple session transactions (i.e. commits/rollbacks) and/or Flowfiles You only need to work with one FlowFile at a time Jython and Python capabilities in Cloudera Flow Management offer several ways to use the Python programming language and its libraries to rapidly develop NiFi components. Leveraging Jython in scripted components offer a rich set of features for those developers more comfortable with the Python language, and using Python libraries to develop NiFi processors enables more integrations with external systems such as GenAI applications. The combination of these in Cloudera Flow Management presents a powerful environment for which to enhance and enrich your flows to achieve a higher level of business success. Learn More: To explore the new capabilities of Cloudera Flow Management 4 and discover how it can transform your data pipelines with native Python support, watch our The Five Things You Need To Know about Unlocking The Power of NiFi 2.0 webinar . For a full list of connectors in Cloudera Flow Management, visit the Cloudera Connectors Library.
... View more
Labels:
03-21-2025
09:57 AM
This may be related to https://issues.apache.org/jira/browse/NIFI-11783 which was fixed in CFM 2.1.6. Is upgrading your CFM an option?
... View more