About pvillard

pvillard · ‎07-16-2024

Cloudera recently announced the release of Kubernetes Operators for Data in Motion. These Kubernetes Operators enable customers to deploy Apache NiFi, Apache Kafka, and Apache Flink clusters on Kubernetes application platforms such as Red Hat OpenShift, the industry’s leading hybrid cloud application platform powered by Kubernetes. With these Kubernetes Operators, customers can easily deploy end-to-end data streaming capabilities on their existing Kubernetes clusters and benefit from auto-scaling, efficient resource management, and streamlined setup and operations. In this blog post, we will revisit the advantages of running Apache NiFi on Kubernetes and take a closer look at the architecture and deployment model when using the Cloudera Flow Management Kubernetes Operator. Advantages of running Apache NiFi on Kubernetes Apache NiFi is a powerful tool for building data movement pipelines using a visual flow designer. Hundreds of built-in processors make it easy to connect to any application and transform data structures or data formats. Since it supports both structured and unstructured data for streaming and batch integrations, Apache NiFi is a core component of modern data pipelines and multimodal Generative AI use cases. While it is incredibly versatile and easy to use, Apache NiFi often requires significant administrative overhead. The recent release of Cloudera Flow Management Kubernetes Operator simplifies administration, improving development speed and resource utilization to deliver greater innovation at a lower Total Cost of Ownership (TCO). Apache NiFi deployments typically start small, with a limited number of users and data flows, and they grow quickly once organizations realize how easy it is to implement new use cases. During this growth phase, organizations run into multi-tenancy issues, like resource contention and all of their data flows sharing the same failure domain. To overcome these challenges, organizations typically start creating isolated clusters to separate data flows based on business units, use cases, or SLAs. This adds significant overhead in terms of operations, maintenance, and more. Depending on how the clusters are sized initially, organizations often need to add additional compute resources to keep up with the growing number of use cases and ever-increasing data volumes. While NiFi nodes can be added to an existing cluster, it is a multi-step process that requires organizations to set up constant monitoring of resource usage, detect when there is enough demand to scale, automate the provisioning of a new node with the required software, and configure security. Downscaling is even more complex because users must make sure that the NiFi node they want to decommission has processed all of its data and does not receive any new data to avoid potential data loss. Implementing an automated scale-up and scale-down procedure for NiFi clusters is complex and time-consuming. Finally, when running Apache NiFi on bare metal or Virtual Machines (VMs), losing a node will make the in-flight data unavailable as long as the node is. NiFi ensures no data is lost, but temporarily having data unavailable can be an issue for many critical use cases. Users can use the NiFi Stateless engine and proper flow design to solve this problem, but implementing this solution can be another challenge for the development team. Ultimately, these challenges force NiFi teams to spend a lot of time on managing the cluster infrastructure instead of building new data flows, which slows down adoption. Running NiFi on Kubernetes solves all of these challenges: It is easy to scale a NiFi cluster up and down, either manually by updating the deployment configuration file, or automatically based on resources consumption using an Horizontal Pod Autoscaler (HPA). It is easy to start new NiFi clusters in minutes and manage multiple NiFi deployments in one place. It is also possible to easily have many different NiFi versions to test new features, and users can separate use cases into dedicated NiFi clusters to ensure resource isolation. It ensures data durability, even if a NiFi node goes down. Kubernetes takes care of provisioning a new node and moving the volumes from the dead node to the new node, ensuring uninterrupted processing of the in-flight data. In addition to those advantages, the Cloudera Flow Management Kubernetes Operator brings several additional features to the table: It removes the Zookeeper dependency by doing leader election and state management in Kubernetes. It supports rolling upgrades with NiFi 2. It enables customers to bring their own Kubernetes cluster. Users don’t have to dedicate the Kubernetes cluster to Cloudera workloads, and there is no dependency on any other Cloudera product. Diving into running Cloudera Flow Management on RedHat OpenShift The Cloudera Flow Management Operator is responsible for managing NiFi and NiFi Registry deployments. It is deployed within a designated operator namespace, while the actual NiFi and NiFi Registry instances are managed within one or more separate namespaces. The following diagram shows a typical NiFi deployment with Cloudera Flow Management Operator: Cloudera Flow Management Operator architecture overview Installing the Cloudera Flow Management Kubernetes Operator To make things extremely easy, Cloudera provides a CLI to help with the installation of the operator in the Kubernetes cluster. After connecting to the cluster: $ oc login https://api….openshiftapps.com:6443/ -u myUser Users must install cert-manager, which will be used to provision and sign the certificates for the NiFi and NiFi Registry resources. $ helm install cert-manager jetstack/cert-manager \ --namespace cert-manager \ --create-namespace \ --set installCRDs=true You will want to configure cert-manager with a CA certificate of your choice to sign the provisioned certificates. You can also use a self-signed CA certificate for Proof-of-Concept and testing. Once you are done, and once you have downloaded the cfmctl CLI provided by Cloudera, you can easily install the operator: $ kubectl create namespace cfm-operator-system namespace/cfm-operator-system created $ kubectl create secret docker-registry docker-pull-secret \ --namespace cfm-operator-system \ --docker-server container.repository.cloudera.com \ --docker-username ****** \ --docker-password ****** secret/docker-pull-secret created $ ./cfmctl install –-license <license file> The section above shows how to create a new namespace ‘cfm-operator-system’ and a secret, which is configured with your Cloudera credentials and will be used to access the required container images. Once the namespace and secret are in place, all you have to do is run cfmctl and specify the license file, and it will install the operator on your cluster. Deploying a NiFi cluster on Kubernetes As described in the documentation, you can now write a YAML file that will describe your desired NiFi and NiFi Registry deployments. While examples of such files can be found in the documentation, let’s look at the specific parts that constitute a deployment file. apiVersion: cfm.cloudera.com/v1alpha1 kind: Nifi metadata: name: mynifi namespace: my-nifi-cluster The section above defines the name that will be used for the NiFi cluster and the NiFi nodes, and it also specifies the namespace in which resources will be deployed. A single NiFi cluster should be deployed in a given namespace. replicas: 3 The number of replicas defines the number of NiFi nodes in the NiFi cluster. Editing this value and submitting the updated file to the operator will effectively scale up and down the NiFi cluster. This is the manual approach for scaling the cluster. image: repository: container.repository.cloudera.com/cloudera/cfm-nifi-k8s tag: 2.8.0-b10 pullPolicy: IfNotPresent pullSecret: docker-pull-secret tiniImage: repository: container.repository.cloudera.com/cloudera/cfm-tini tag: 2.8.0-b10 pullPolicy: IfNotPresent pullSecret: docker-pull-secret This section describes the images used for running NiFi. This provides a way to manually upgrade the NiFi version in an existing cluster or very quickly roll out NiFi clusters with new versions. persistence: size: 10Gi storageClass: nifi-storage-class contentRepo: size: 10Gi storageClass: nifi-storage-class flowfileRepo: size: 10Gi storageClass: nifi-storage-class provenanceRepo: size: 10Gi storageClass: nifi-storage-class This section specifies the storage that will be used for the NiFi Repositories. It can be defined globally or overridden for specific repositories. The storage classes must be specified at the OpenShift level to match the IOPS expectations for your NiFi workloads. NiFi requires persistent volumes and the storage class to support read and write operations. security: ldap: authenticationStrategy: SIMPLE managerDN: "cn=admin,dc=example,dc=org" secretName: openldap-creds referralStrategy: FOLLOW connectTimeout: 3 secs readTimeout: 10 secs url: ldap://my.ldap.server.com:389 userSearchBase: "dc=example,dc=org" userSearchFilter: "(uid={0})" identityStrategy: USE_USERNAME authenticationExpiration: 12 hours sync: interval: 1 min userObjectClass: inetOrgPerson userSearchScope: SUBTREE userIdentityAttribute: cn userGroupNameAttribute: ou userGroupNameReferencedGroupAttribute: ou groupSearchBase: "dc=example,dc=org" groupObjectClass: organizationalUnit groupSearchScope: OBJECT groupNameAttribute: ou initialAdminIdentity: nifiadmin nodeCertGen: issuerRef: name: self-signed-ca-issuer kind: ClusterIssuer The security section contains the information that will be injected into the identity provider configuration file of NiFi, as well as the authorizers configuration file. In this example, we retrieve the users and groups from an external LDAP server. The user nifiadmin is the initial admin of NiFi. This user can log into the NiFi UI and set the appropriate policies based on users and groups for other users to access the NiFi UI and use the NiFi deployment. Don't forget to create the secret for storing the password of the user accessing LDAP: kubectl -n my-nifi-cluster create secret generic openldap-creds --from-literal=managerPassword=Not@SecurePassw0rd Finally, the security section also references the certificate issuer that should be used from cert-manager to issue certificates for all the provisioned resources. hostName: mynifi.openshiftapps.com uiConnection: type: Route serviceConfig: sessionAffinity: ClientIP routeConfig: tls: termination: passthrough The section above defines how to provide access to the NiFi UI to users, who use a Route with the CFM Kubernetes Operator on OpenShift. configOverride: nifiProperties: upsert: nifi.cluster.load.balance.connections.per.node: "1" nifi.cluster.load.balance.max.thread.count: "4" nifi.cluster.node.connection.timeout: "60 secs" nifi.cluster.node.read.timeout: "60 secs" nifi.cluster.leader.election.implementation: "KubernetesLeaderElectionManager" bootstrapConf: upsert: java.arg.2: -Xms2g java.arg.3: -Xmx2g java.arg.13: -XX:+UseConcMarkSweepGC stateManagement: clusterProvider: id: kubernetes-provider class: org.apache.nifi.kubernetes.state.provider.KubernetesConfigMapStateProvider The section above provides a way to customize the NiFi deployment by overriding properties that are defined in the NiFi properties configuration file, as well as the bootstrap configuration file to configure the amount of memory allocated for the heap of NiFi. This example also includes the configuration that specifies that Kubernetes should handle leader election and state management. Finally, we can specify the resources (CPU and memory) that should be allocated to the NiFi containers: resources: nifi: requests: cpu: "2" memory: 4Gi limits: cpu: "2" memory: 4Gi log: requests: cpu: 50m memory: 128Mi Cloudera recommends using a ratio of 1:1:2 to configure the number of cores, the amount of memory allocated for the heap of NiFi, and the amount of memory for the NiFi containers. In the example above, there are 2 cores, 2GB of memory for the heap of NiFi and 4GB of memory for the NiFi container. For automatic auto-scaling based on resources consumption, configure an Horizontal Pod Autoscaler: apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: nifi-hpa spec: maxReplicas: 3 minReplicas: 1 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 75 scaleTargetRef: apiVersion: cfm.cloudera.com/v1alpha1 kind: Nifi name: mynifi Submit the deployment file to Kubernetes to deploy and/or update the NiFi cluster: $ kubectl create namespace my-nifi-cluster namespace/my-nifi-cluster created $ kubectl apply -f nifi.yaml nifi.cfm.cloudera.com/mynifi created $ kubectl apply -f nifi-hpa.yaml The NiFi cluster is deployed and the initial administrator can access the NiFi UI. Deploying a NiFi Registry instance on Kubernetes Deploying a NiFi Registry instance is very similar to the process for deploying a NiFi cluster. The file below describes a NiFi Registry deployment: apiVersion: cfm.cloudera.com/v1alpha1 kind: NifiRegistry metadata: name: mynifiregistry namespace: nifi-registry spec: image: repository: container.repository.cloudera.com/cloudera/cfm-nifiregistry-k8s tag: 2.8.0-b10 tiniImage: repository: container.repository.cloudera.com/cloudera/cfm-tini tag: 2.8.0-b10 hostName: mynifiregistry.openshiftapps.com uiConnection: type: Route routeConfig: tls: termination: passthrough security: initialAdminIdentity: nifiadmin nodeCertGen: issuerRef: name: self-signed-ca-issuer kind: ClusterIssuer ldap: authenticationStrategy: SIMPLE managerDN: "cn=admin,dc=example,dc=org" secretName: secret-openldap referralStrategy: FOLLOW connectTimeout: 3 secs readTimeout: 10 secs url: ldap://my.ldap.server.com:389 userSearchBase: "dc=example,dc=org" userSearchFilter: "(uid={0})" identityStrategy: USE_USERNAME authenticationExpiration: 12 hours sync: interval: 1 min userObjectClass: inetOrgPerson userSearchScope: SUBTREE userIdentityAttribute: cn userGroupNameAttribute: ou userGroupNameReferencedGroupAttribute: ou groupSearchBase: "dc=example,dc=org" groupObjectClass: organizationalUnit groupSearchScope: OBJECT groupNameAttribute: ou Similarly, the nifiadmin user is the initial admin user who can access the NiFi Registry UI and set the proper policies for users and groups. Additionally, the admin should create proxy policies for the nodes of the NiFi clusters that interact with the NiFi Registry instance by adding a user with the name CN=<your CR name>, O=Cluster Node for every cluster via the NiFi Registry UI, and proxy permissions for each user. $ kubectl create namespace nifi-registry namespace/nifi-registry created $ kubectl -n nifi-registry create secret generic secret-openldap --from-literal=managerPassword=Not@SecurePassw0rd $ kubectl apply -f nifiregistry.yaml nifiregistry.cfm.cloudera.com/mynifiregistry created The screenshot below shows the user and permissions required for NiFi to successfully interact with the NiFi Registry instance: Create this user and the associated policies in the NiFi Registry to enable NiFi requests Putting everything together The following screenshots show access to the NiFi cluster where a process group can be versioned-controlled in the NiFi Registry instance: The NiFi canvas with a version-controlled process group running on Kubernetes NiFi Registry running on Kubernetes with one bucket where flows are stored and versioned When scaling down the NiFi cluster, the operator executes all of the required steps in the right order to ensure no data is lost, and the data is properly offloaded from the removed nodes onto the remaining nodes. Conclusion The Cloudera Flow Management Kubernetes Operator makes it easy to deploy NiFi clusters and NiFi Registry instances on an OpenShift cluster. Running your data flows at scale on Kubernetes is now only a few commands away. For a demonstration, watch our release webinar and visit the product documentation to learn more.

pvillard · ‎05-17-2024

Hi, yeah, I do intend to record a video regarding the Rules Engine (I'm giving a webinar on next Tuesday that will also cover that topic and new things coming with NiFi 2), but I can also plan a couple of videos on Stateless and Python components.

pvillard · ‎03-01-2024

Hi @SAMSAL - thanks for describing your experience with the new python stuff. We're more than happy to receive pull requests to improve the documentation around this. It'll be my pleasure to review and merge any submission you'd like to make. Here is the file in Github where the documentation is for the python stuff: https://github.com/apache/nifi/blob/main/nifi-docs/src/main/asciidoc/python-developer-guide.adoc

pvillard · ‎01-12-2023

Apache Flink Upgrade Deployments can now upgrade from Flink 1.14 to 1.15.1. This update includes 40 bug fixes and a number of other enhancements. To learn more about what has been fixed check out the release notes. SQL Stream Builder UI The Streaming SQL Console (UI) of SQL Stream Builder has been completely reworked with new design elements. The new design provides improved user access to artifacts that are commonly used or already created as part of a project simplifying navigation and saving the user time. Software Development Lifecycle support with Projects Projects for SQL Stream Builder that improves upon the Software Developer Life Cycle needs of developers and analysts writing applications, allowing them to group together related artifacts and sync them to GitHub for versioning and CI/CD management. Before today, when users would create SQL jobs, functions, and other artifacts there was no effective way to migrate them to another environment for use (ie: from dev to prod.) The typical way that artifacts were migrated was through copy and pasting the code between environments; this lived outside of code repositories, the typical CI/CD process many companies utilize, took additional hands on keyboard time, allowed potential errors to be introduced during the copying process and when updating specific environmental configurations. These issues are solved with SQL Stream Builder Projects. Users just simply create a new project giving it a name and link a GitHub repository to it as part of creation, from this point onward any artifacts created in the project can be pushed to the GitHub Repository with a click of a button. For environment specific needs a parameterized key value configuration can be used to prevent having to edit configurations that change between deployments by referencing generic properties that are set differently between environments. Job Notifications Job notifications can help make sure that you can detect failed jobs without checking on the UI, which can save a lot of time for the user. This feature is very useful, especially when the user has numerous jobs running and keeping track of their state would be hard without notifications. Notifications can be made to send to a single user or a group of users both over email or by using a webhook. Summary In this post, we looked at some of the new features that came out in CDP Public Cloud 7.2.16. This includes Flink 1.15.1 which comes with many bug fixes, a brand new UI for SQL Stream Builder, the ability to monitor jobs for failures and send notifications and new Software Development Lifecycle capabilities with Projects. For more details read the latest release notes. Give Cloudera Streaming Analytics 7.2.16 for Datahub a try today and check out all the greatest new features added!

pvillard · ‎12-14-2022

You can find the release notes and the download links in the documentation. Key features for this release Rebase against NiFi 1.18 bringing the latest and greatest of Apache NiFi. It contains a ton of improvements and new features. Reset of the end of life policy: CFM 2.1.5 will be supported until August 2025 to match the CDP 7.1.7 LTS policy. It is particularly important as HDF and CFM 1.x are near to end of life. Parameter Providers: we are introducing the concept of Parameter Providers allowing users to fetch the values of parameters from external locations. In addition to a better separation of duties, it is also very useful to make CI/CD better and easier. With this release, we're supporting the following Parameter Providers: AWS Secret Manager GCP Secret Manager HashiCorp Vault Database Environment Variables External file Registry Client to connect to a DataFlow Catalog. The registry endpoint is now an extension in NiFi. It means that it is no longer limited to accessing a NiFi Registry instance. With this release we're adding an implementation allowing users to connect NiFi to their DataFlow Catalog and use it just like they would with NiFi Registry. For hybrid customers, it means that they can easily checkout and version flow definitions in the same place for both on-prem and cloud usage. It also means that on-prem customers can access the ReadyFlows gallery, assuming they have a public cloud tenant. Iceberg processor (Tech Preview). We're making available a PutIceberg processor in Technical Preview allowing users to push data into Iceberg using NiFi. This can be used in both batch and streaming (micro-batch) fashion. Snowflake ingest with Snowpipe (Tech Preview). Until now, only JDBC could be used to push data into Snowflake with NiFi. We're now making available a set of processors leveraging Snowpipe to push data into Snowflake in a more efficient way. New components: we are adding a bunch of new components... ConsumeTwitter Processors to interact with Box, Dropbox, Google Drive, SMB Processors to interact with HubSpot, Shopify, Zendesk, Workday, Airtable PutBigQuery (leveraging the new API) ListenBeats is now Cloudera supported UpdateDatabaseTable to manage updates on table's schema (add columns for example) AzureEventHubRecordSink & UDPEventRecordSink CiscoEmblemSyslogMessageReader to make it easy to ingest logs from Cisco systems such as ASA VPNs ConfluentSchemaRegistry is now Cloudera supported Iceberg and Snowflake components as mentioned before Replay last event: with this release we add the possibility to replay the last event at processor level (right-click on the processor, replay last event). This is making it super easy to replay the last flow file (instead of going to the provenance events, take the last event and click replay). This is something very useful when developing flows! And, as usual, bug fixes, security patches, performance improvements, etc.

pvillard · ‎09-16-2021

With the release of CDP 7.2.11, it is now super easy to deploy your custom components on your Flow Management DataHub clusters by dropping your components in a bucket of your cloud provider. Until now, when building custom components in NiFi, you had to SSH to all of your NiFi nodes to deploy your components and make them available to use in your flow definitions. This was adding an operational overhead and was also causing issues when scaling up clusters. From now on, it's easy to configure your NiFi clusters to automatically fetch custom components from an external location in the object store of the cloud provider where NiFi is running. All of your nodes will fetch the components after you dropped them in the configured location. You can find more information in the documentation.

pvillard · ‎09-16-2021

With the release of CDP 7.2.11, you now have the possibility to scale up and down both your light duty and heavy duty Flow Management clusters on all cloud providers. You can find more information in the documentation.

pvillard · ‎03-09-2018

Hi @Jessica David, I confirm this is a bug. I created a JIRA for that: https://issues.apache.org/jira/browse/NIFI-4955 I will submit a fix in a minute. Thanks for reporting the issue. I assume you're using the header as the schema access strategy in the CSV Reader. If you're able to use a different strategy (schema name, or schema text), it should solve the problem even though you need to explicitly define the schema.

pvillard · ‎03-08-2018

Hi @Sami Ahmad, As stated in the processor description/documentation: This processor uses Hive Streaming to send flow file data to an Apache Hive table. The incoming flow file is expected to be in Avro format and the table must exist in Hive. Please see the Hive documentation for requirements on the Hive table (format, partitions, etc.). The partition values are extracted from the Avro record based on the names of the partition columns as specified in the processor. NOTE: If multiple concurrent tasks are configured for this processor, only one table can be written to at any time by a single thread. Additional tasks intending to write to the same table will wait for the current task to finish writing to the table. You'll need to convert your data into avro first. The best approach is to use the record processors for that. Hope this helps.

pvillard · ‎03-08-2018

That's completely valid. Let's say that HTTP has the advantage of not requiring any new configuration when installing NiFi. In some environments, adding the use of an additional port can create additional work with network/security teams. Another advantage of HTTP is the possibility to use a proxy if that's required to allow communication between multiple sites. If you expect to have high load with S2S and can manage the extra conf, RAW is certainly a better option. Hope this helps.

Online	Offline
Last Visited	‎07-30-2024 08:59 AM

Member Since	‎04-11-2016 09:20 AM
Last Visited	‎07-30-2024 08:59 AM
Posts	471
Kudos received	325

Cloudera Community

Re: ValidateRecord doesn't maintain column order?

Re: For NiFi S2S, is it better to us load balancer...

Re: How to Limit Number of Threads for Each Proces...

Re: Once YARN queue is at capacity, running jobs s...

Re: putHiveQL error

Cloudera Flow Management Operator - A technical de...

Re: New Apache Nifi Release 2.0.0 M3 with Python e...

Re: 2.0.0-M1 Python extension issues with custom p...

New Features in Cloudera Streaming Analytics for C...

Release of CFM 2.1.5 for CDP 7.1.7 & CDP 7.1.8

Flow Management in Public Cloud DataHub - Hot load...

Flow Management in Public Cloud DataHub - Supporti...

Re: ValidateRecord doesn't maintain column order?

Re: NIFI putHiveStreaming processor error

Re: For NiFi S2S, is it better to us load balancer...