About mburgess

mburgess · ‎10-29-2025

One of the most exciting and valuable features (in my opinion) of Cloudera Flow Management 4 (powered by Apache NiFi 2) is the Flow Analysis Rules engine and the various rule implementations you can bring to bear. The Flow Analysis Rules Engine is a built-in capability to Apache NiFi 2 whose mission is to help Flow Administrators enforce Best Practices in flow design. This includes limiting how many resources (threads, memory, disk, etc.) a component can leverage. For example, a Flow Admin may want to restrict all processors to use a maximum of 10 concurrent tasks. One reason for this (especially in large flows) is to prevent a Flow Designer from "hogging" the threads away from other components / teams, or to maximize parallelization across physical CPU cores rather than concurrency (which can be viewed as "apparently parallel" but not truly parallel). The same goes for disk usage that can be enforced by limiting the size of a connection's backpressure settings to prevent too much data from amassing on disk without being able to be processed by the flow. I start by showing a Flow Analysis rule and how to configure and enable for reporting warnings or violations. To create a Flow Analysis rule (which is a "management controller service"), go to the hamburger menu and select Controller Services. There is a tab for Flow Analysis Rules. Selecting that then the + icon on the right brings up the creation dialog: To start, we create a DisallowComponentType rule, then edit the configuration: Before we get into the properties for this particular, note the Enforcement Policy on the right. The current choices are Warn and Enforce: If a rule is set to Warn, a component that violates the rule can still be started and run, but the warning will show up on the Rules Violations report which I will show later. If a rule is set to Enforce, even if a component is configured correctly, it will be marked as Invalid and is unable to be started/run. This violation also shows up on the report. In this case I will set it to Enforce and move on to the rule-specific properties: There is a single property for this rule, namely the type of component that should be disallowed. I have chosen ExecuteScript to prevent any Flow Designer from using ExecuteScript as the script author can perform tasks that may be harmful to the larger CFM/NiFi instance. I apply this change and return to the canvas, where I create an ExecuteScript processor and configure it so it would normally be ready to run: Going back to the rule, I will click on the sideways-ellipsis icon on the right and Enable the rule: Returning to the canvas again, I click on the "first-aid kit" icon on the top-right of the UI. With a Warning violation, the icon will turn orange and with a Enforced violation the icon will be red. Clicking on it brings up the Rules Analysis Report: Here you can see ExecuteScript has been marked invalid and the violation shows up under Enforced violations. Also, since the Rules Analysis Report is available on the global NiFi canvas, a violation might belong to a component deeply nested in the flow. Clicking on the sideways-ellipsis allows you to navigate in the canvas to the component in violation: This helps you quickly get to the offending component and take appropriate action. Here's a list of Flow Analysis Rules available in Apache NiFi 2.6.0: DisallowComponentType RestrictBackpressureSettings RequireServerSSLContextService RestrictFlowFileExpiration The following is the list of Flow Analysis Rules available in Cloudera Flow Management 4.11: DisallowComponentType RestrictBackpressureSettings RequireServerSSLContextService RestrictFlowFileExpiration RestrictYieldDurationForConsumeKafkaProcessors DisallowDeadEnd RecommendRecordProcessor RequireHandleHttpResponseAfterHandleHttpRequest RestrictComponentNaming RestrictConcurrentTasksVsThreadPoolSizeInProcessors RequireMergeBeforePutIceberg DisallowDeprecatedProcessor DisallowConsecutiveConnectionsWithRoundRobinLB DisallowExtractTextForFullContent RestrictSchedulingForListProcessors RestrictThreadPoolSize RestrictProcessorConcurrency With more on the way (both in Apache NiFi and Cloudera Flow Management) as more Best Practices are identified and the associated rules are implemented. In this post I described the Flow Analysis Rules engine and its features and capabilities. Please put any questions or comments (and especially ideas for new rules!) in the comments.

mburgess · ‎10-29-2025

In the final installment of Parameterize All The Things mini-series for the Cloudera Flow Management / Apache NiFi Best Practices cookbook, I will discuss Parameter Providers and how to use them. As we've seen in the previous two posts, we can create Parameter Contexts and manually add key/value pairs to be used as Parameter references in NiFi component properties. But manually maintaining these values can be tedious and this doesn't lend itself to automation. Not only that, but it is possible that the parameter values should not be known by the Flow Designers and instead just used to connect to external systems, such as can be the case for database user passwords, secret keys, etc. To solve this problem, the concept of Parameter Providers was introduced. This capability allows for parameters and their values (both sensitive and plain) to be fetched from an external source. These sources include (but aren't limited to) the following: Environment variables Database tables Password vaults such as Hashicorp, and CyberArk Secrets providers such as AWS, GCP, Azure, Kubernetes, and OnePassword A Java properties file (CFM 4.11+ only, not in Apache NiFi 2.x) Parameter providers are "management controller services", which are global to the CFM/NiFi instance and are created and configured from the hamburger menu at the top-right of the NiFi UI. For this post I will illustrate how to create, configure, and fetch parameters from an EnvironmentVariableParameterProvider. On the Parameter Providers tab (from the graphic above) I click the + icon on the far right to create a new instance (bringing up the dialog in the above graphic) and select EnvironmentVariableParameterProvider. This creates a new instance, and configuring it brings up the following dialog: These are the default settings, to get all environment variables available to the Java Virtual Machine (JVM) running CFM/NiFi, and names the Parameter Group "Environment Variables". You can choose your own name as well as configure which environment variables you want to create parameters from, using a comma-separated list or a regular expression. For this example I will keep the defaults and click Apply. Next I select the sideways-ellipsis icon to the right of the Parameter Provider, which offers me the choice to Fetch Parameters: Choosing this brings up the following dialog: From here you can choose to create and name a Parameter Context by clicking the checkbox in the middle, then selecting which of the parameters should be marked as sensitive: For this example I will only mark my CLAUDE_KEY as sensitive, the rest will be non-sensitive parameters: Clicking Apply then Close, I go to Parameter Contexts from the hamburger menu and Edit my Parameter Context: You can see here that all my environment variables were imported as parameters and the CLAUDE_KEY parameter was marked as sensitive. This allows you to use fetched sensitive values in sensitive properties in CFM/NiFi components. To tie all three posts in this series together, I set my Environment Variables parameter context as the parent of the DBCPConnectionPool Best Practice parameter context from part 2: I hope this mini-series on Parameters in Cloudera Flow Management has been informative and helpful. Please leave any questions or found issues in the comments!

mburgess · ‎10-29-2025

In the last article I explained the history of Parameters and Parameter Contexts and how to use them most effectively in your flows. But did you know you can actually swap out Controller Service instances using their UUIDs as parameter values? I will explain why this is valuable as an example: I have a DBConnectionPool (in Apache NiFi) or a PostgreSQLConnectionPool (in Cloudera Flow Management) that points to a PostgreSQL database, and I'm parameterizing all the connection properties using the strategy laid out in part 1 of this series. That's all well and good but what if the folks on the destination side say "we decided to switch to Impala on Cloudera Data Platform because we don't want the cost of managing our PostgreSQL cluster, plus it's integrated with our lakehouse". You can't just change the parameters used in the connection pool because you need an entirely different ControllerService. In Part 1 we were able to swap out configurations for a particular Controller Service, but what if we need to swap out those particular Controller Services? CFM and NiFi provide such a way by being able to refer to a Controller Service ID in a processor (for example). Here's how it's done: First, we create the two DBCPConnectionPool instances. In my example I'll use PostgreSQL and MySQL. I've configured them so they are valid and will work on their own: Note the UUIDs. Then I configure my UpdateDatabaseTable to refer to the Database Connection Pool and Database Type parameters: At first I'm using MySQL so I will set those parameter values (the UUID of the MySQL Connection Pool and the Database Type to MySQL): The UpdateDatabaseTable is valid and can be started and will connect to the MySQL database: Now let's say I want to connect to PostgreSQL instead. I simply change the parameter values for the PostgreSQL Connection Pool and the Database Type of PostgreSQL: Note that I didn't stop UpdateDatabaseTable. Applying these changes will stop the UpdateDatabaseTable processor, set the new values, and restart the processor: That's all there is to it! In this post I illustrated how to swap out Controller Service instances on-the-fly by using Parameters. In the next post I will dive into how to use Parameter Providers to auto-fill parameters and their values from external sources.

mburgess · ‎10-03-2025

This article is part of the Cloudera Flow Management / Apache NiFi Best Practices Cookbook series and will discuss Best Practices for using Parameters and Parameter Contexts in your flows. History First a little background about Parameters: In NiFi 1.x, the original solution for being able to configure various environments (development vs production, e.g.) was Variables. This was a global key/value store of variable names with values associated with them. These variable names could be used in Expression Language expressions as well as certain properties in processors looking specifically for a variable name. This single, global list was not flexible enough to solve many of the configuration challenges of our users. To solve this, Parameters were introduced. They are meant as a replacement for Variables and via Parameter Contexts are much more flexible and powerful due to inheritance (which I will discuss later). Parameters In general, a parameter is referred to by the following syntax in properties and expressions: #{My Parameter} where the parameter is named "My Parameter", versus a variable expression for "my.variable" which is as follows: ${my.variable} Note that the syntax for parameters uses the "#" symbol where the syntax for parameters shares the Expression Language syntax using the "$" symbol. Naming rules are also different; for example parameter names can have spaces where variables cannot. You can use a parameter value alongside an Expression Language expression, such as: #{My Parameter}.${literal('test')} or just #{My Parameter}.test If the value of "My Parameter" is "this.is.a" then the output in both cases is "this.is.a.test". The real power of parameters comes from grouping them and creating a hierarchy of those groups. Parameter groups are called Parameter Contexts, and a Parameter Context can be set on a Process Group (including the root process group). For a Process Group with a given Parameter Context, this means anywhere in the Process Group you can refer to its Parameters in any of the components. To create a Parameter Context, go to the hamburger menu in the top-right corner of the UI and select Parameter Contexts. You can then select the + button to add a new one. You give it a name: Then on the Parameters tab you can select the + button to add a new parameter giving it a name and a value and specifying whether that value is sensitive (such as a password) or not: Now that we have our Parameters in our Parameter Context, we can assign that to our example Process Group on the canvas on its configuration menu: Note the checkbox to "Apply recursively". If this is not checked, only the components inside the Process Group itself have access to the Parameters in the context. If you want all child Process Groups to recursively have access to the Parameters "down the tree", check this box. Apply the changes and we're ready to use this Parameter Context! Overriding Parameter Values Now imagine inside one of the child Process Groups, we want a different value for Username and Password. If we have applied the parent Process Group's Parameter Context, we will be using the parent's values for Username and Password. To override, we need a new Parameter Context, let's call it "Override Dev Process Group for Child PG 1". We then add Username and Password parameters with the different values (I chose "jsmith" as the Username and "joe's password" for Password): Once this is applied, you select the new Parameter Context to be used by the aforementioned Child Process Group 1: Looking at Child Process Group 2, you can see it has the parent's Parameter Context: This is how you override parameters inside a child Process Group, while allowing other child process groups to use the parent's parameter values. Swapping Parameter Contexts This "overriding" concept can be applied to the parent process group. For example, let's say we want to run the same flow in the Production environment. You could edit the parameter values, but then the flow won't run on the Development system. The correct way to do this is to create a new process context for the production environment, setting the parameters and their values (in this case "mburgess" remains the Username and "prod_p@ssword" for Password: When we want the flow to run in the production environment, we only need to swap the process context from Dev to Prod on the parent Process Group: In this way you can run the same flow in multiple environments. Summary In this article I presented the history of Variables and Parameters, then discussed Parameter Contexts as groups of parameters. Using the inheritance and overriding features, I illustrated how parameter values can be overridden in specific child Process Groups, and otherwise inherited from the parent Process Group. In the next article, I will extend this concept even more. Not only can you use parameters in processor properties for configuration, but you can use parameters to reference Controller Services so you can swap and/or override actual Controller Service implementations!

mburgess · ‎10-03-2025

Welcome! In this series of articles I will address various topics about Cloudera Flow Management (powered by Apache NiFi) in terms of Best Practices about how, when, and why to use the powerful and flexible features of Cloudera Flow Management (CFM) / Apache NiFi. I will use the terms CFM and NiFi interchangeably but will specify when something is specific to Cloudera Flow Management and does not apply to Apache NiFi. These examples and screenshots will be taken both from CFM 2 (powered by Apache NiFi 1.x) and CFM 4 (powered by Apache NiFi 2.x) when prudent to illustrate the features and Best Practices that are common to both CFM 2 and 4, as well as when something is specific to CFM 4 / NiFi 2. Links will be present for articles that have been published. This Best Practices Cookbook will feature the following articles, but please like/subscribe as articles may be added and/or edited at various times. Also if you have any ideas on Best Practices articles that may be helpful to the community, please leave them in the comments below. Happy Flowing! Best Practices Cookbook Parameterize All The Things! (part 1 - Parameters and Parameter Contexts) Parameterize All The Things! (part 2 - Parameterizing Controller Service References) Parameterize All The Things! (part 3 - Parameter Providers) Flow Analysis Rules (enforcing Best Practices in Flow Design) Backpressure (Maximize throughput) Schema Drift (what to do when your source data structures change) Unlocking the power of Registry (Version Control, Flow Development Lifecycle, etc.)

mburgess · ‎10-03-2025

Do you need the script or can you just configure a DBCPConnectionPool to point at the Impala JARs and the connection URL? Also that version may have an ImpalaConnectionPool which already includes the driver and is easier to use to configure a connection to Impala.

mburgess · ‎09-30-2025

You can use ExecuteGroovyScript with the following script: def ff = session.get() if (!ff) return def obj = new groovy.json.JsonSlurper().parse(ff.read()) def outObj = [] // Find updated records def old_ids = obj.old_orders.collect {it.order_id} def latest_ids = obj.latest_orders.collect {it.order_id} old_ids.intersect(latest_ids).each {order_id -> def update_order = obj.latest_orders.find {it.order_id == order_id} update_order.Action = 'UPDATE' outObj += update_order } // Find deleted records (old_ids - latest_ids).each {order_id -> def delete_order = obj.old_orders.find {it.order_id == order_id} delete_order.Action = 'DELETE' outObj += delete_order } // Find new records (latest_ids - old_ids).each {order_id -> def new_order = obj.latest_orders.find {it.order_id == order_id} new_order.Action = 'NEW' outObj += new_order } ff.write('UTF-8', groovy.json.JsonOutput.toJson(outObj)) REL_SUCCESS << ff

mburgess · ‎09-20-2025

If "Infer Schema" isn't working this is likely a bug. Could you provide an example JSON and the error message that happens during schema inference?

mburgess · ‎06-26-2025

If you know where the CSV files are on the filesystem and the condition is simple, you may be able to start with CSV file 1 then use 2 LookupRecord processors in sequence with 2 CSVRecordLookupService controller services (each pointing at CSV file 2 and 3 respectively). If that doesn't suit your needs, check out the ForkEnrichment and JoinEnrichment processors, they may be able to do what you need.

mburgess · ‎06-19-2025

Provenance/lineage is not currently visible from the Flow Designer. This is intended because the Flow Designer UI is for flow design regardless of whether there is a Test Session or deployment active. Provenance and lineage is associated with actual data running through a deployment, so to view these you'll need to navigate to the Cloudera Flow Management (NiFi) canvas from the deployment view once your flow has been deployed. From the canvas you can proceed as the video instructs and hopefully it looks familiar at that point.

Online	Offline
Last Visited	‎10-29-2025 10:31 AM

Member Since	‎11-16-2015 02:21 PM
Last Visited	‎10-29-2025 10:31 AM
Posts	905
Kudos received	659

Cloudera Community

Re: Compare data within the JSON using NIFI

Re: how to join three csv files like sql on condit...

Re: How to see the Data Provenance and Lineage in ...

Re: Apache NiFi - RouteText has no matches

Re: Nifi Building error when creating a brand new ...

Flow Analysis Rules (enforcing Best Practices in F...

Parameterize All The Things! (part 3 - Parameter P...

Parameterize All The Things! (part 2 - Parameteriz...

Parameterize All The Things! (part 1 - Parameters ...

Cloudera Flow Management / Apache NiFi Best Practi...

Re: Nifi Unable to use Impala connection in the ex...

Re: Compare data within the JSON using NIFI

Re: Trouble Indexing data to elasticsearch using N...

Re: how to join three csv files like sql on condit...

Re: How to see the Data Provenance and Lineage in ...