Created on 05-18-202306:40 AM - edited on 05-18-202310:49 PM by VidyaSargur
Today I finally found the time and subject matter to start an article showcasing how I am working to Operationalize NiFI Flows with Cloudera's CDP Public Cloud DataFlow Data Service. I have been so busy working with my Cloudera accounts while training up on all things CDP I just have not had a chance to write a community article. I am very excited to share my experience and this incredible modernization Cloudera has provided on top of NiFi.
Traditionally, large scale NiFi Environments and the Nifi Canvas contain all the NiFi process groups and data flows. As you can see here, this gets visually complicated:
Operating NiFi in this manner makes operations around the data flows on this NiFI canvas technically complicated. Even with user access/auth via Ranger, the root level canvas would still show the entire cluster's Process Groups. If you are an Operations owner of NiFi with hundreds of flows you did not make, it would be impossible to find flow errors, know what flows are actually doing, tune the flows for performance, version flows, or identify "noisy neighbor" flows.
The answer to these problems is the new DataFlow Experience, one of CDP Public Clouds' newest Kubernetes driven Data Services.
Cloudera DataFlow for Public Cloud (CDF-PC) is a cloud-native service that enables self-serve deployments of Apache NiFi data flows from a central catalog. DataFlow Deployments provides a cloud-native runtime to run your Apache NiFi flows through auto- scaling Kubernetes clusters, and centralized monitoring and alerting capabilities for the deployments. DataFlow Functions provides a cloud-native runtime to run your Apache NiFi flows as functions on the serverless compute services of AWS Lambda, Azure Functions, and Google Cloud Functions, targeting use cases that do not require always running NiFi flows.
What You Will Find In DataFlow
Think of the Data Catalog as your cloud version of NiFi Registry. Here you can create, version, and deploy your flows. Deployed into the CDP Public Cloud control plane, the Data Catalog allows you to deploy flows to multiple environments in any cloud (AWS, Azure, or GCP).
The Ready Flow is a gallery of pre-made, easily deployed Data Flows. These ready flows solve common use cases and are a great starting point for new Data Flow users to get used to deploying individual flows. These flows are fully parameterized so it is possible to deploy and operate these flows without touching the NiFi canvas. Some of the Ready Flows I have used recently are:
The newest edition to the Data Flow family, the Flow Designer is a serverless NiFi Flow Design UI that allows you to create, test, and publish Data Flows to the Data Catalog. Very similar to NiFi the Flow Designer provides all the same capabilities but in a reduced UI with direct testing and integration around Data Flow.
This is one of the coolest new NiFi things to come: DataFlow Functions. Using the Data Catalog, you are able to grab a Data Flow's CRN and use that data flow as a cloud function within your cloud provider of choice. Now you are able to deploy stateless NiFi functions that live and execute on event triggers in your cloud region(s).
To move a NiFi data flow from legacy NiFi to Data Flow, you need a flow definition file for each Data Flow. This is the JSON version of your flow and you can import this straight into Data Catalog. If you have a newer version of NiFi, you can simply right-click on a process group and choose Download Flow Definition. If you are using an older version of NIFI, you must convert your template to a flow definition file.
Once you have imported your flow definition file, it is now time to start doing some upgrades. Deploy this flow, then navigate into NIFI UI for the deployed flow, make your revisions, export a new flow definition, upload a new flow definition to Data Catalog, deploy again, remove the previous deployment, and repeat.
A few things we want to consider during flow upgrades:
Convert all sensitive properties to Parameters. Always use parameters vs variables. These are able to be provided during deployment.
Rename all processors to meaningful names. For example: "Get Customer ID" versus "EvaluateJsonPath" for a processor that would evaluate JSON for a Customer Id.
Rename any queues which you wish to track with KPIs with unique names. Any KPIs attached to the data flow need to reference unique names within the flow. For example: "CustomSuccess" versus the default "success".
To convert your template XML to flow definition file JSON, check out this lil project I created: