Community Articles

rhryniewicz · ‎07-21-2020

This article contains Questions & Answers on Cloudera DataFlow (CDF).

What is the difference between Cloudera’s NiFi and Apache NiFi?

Cloudera Flow Management (CFM) is based on Apache NiFi but comes with all the additional platform integration that you’ve just seen in the demo. We make sure it works with CDP’s identity management, integrates with Apache Ranger and Apache Atlas. The original creators of Apache NiFi work for Cloudera.

Does NiFi come with CDP public cloud or is it an add on?

Any CDP Public Cloud customer can start using NiFi by creating Flow Management clusters in CDP Data Hub. You don’t need additional licenses and you will be charged based on how many instance hours your clusters consume

What types of read/write data does NiFi support?

NiFi supports 400+ processors with many sources/destinations.

Do you support cloud-native data sources and sinks?

Yes, we support a lot of cloud-native sources and sinks with dedicated processors for AWS, Azure, and GCP. It allows you to interact with the managed services and object storage solutions of the cloud providers (S3, GCS, ADLS, BlobStorage, EventHub, Kinesis, PubSub, BigQuery, etc).

When you load data into object storage, what details do you need to know?

Two options. One, putting data into an object store under CDP control is simple. u You need to know where you want to write to and your CDP username + password... Two, if you use an object store outside of CDP control, then you need to use the cloud connector and specifics about the authentication method that you want to use.

How does using NiFi in the upcoming DataFlow service differ from using NiFi as a Flow Management cluster on CDP Data Hub?

CDP Data Hub makes it very easy to create a fully secure NiFi cluster using the preconfigured Flow Management cluster definitions. These clusters run on Virtual Machines and offer the traditional NiFi development and production experience. CDP DataFlow Service on the other hand focuses on deploying and monitoring NiFi data flows. It takes care of deploying the required NiFi infrastructure on Kubernetes, providing auto-scaling and better workload isolation. Running NiFi in CDP DataFlow Service will be ideal for NiFi flows where you expect bursty data. Since the service manages the underlying cluster lifecycle, you can focus on developing and monitoring your data flows.

How does DataFlow compare with the tools I can use from cloud vendors?

Flow Management is based on Apache NiFi which is not available from any other cloud vendor. In addition to CDP being the only cloud service provider for Apache NiFi, our additional Streams Messaging and Streaming Analytics components are tightly integrated with each other allowing centralized security policy management and data governance. This is powered by Cloudera SDX and helps to understand the end-to-end data flow across the entire Cloudera Flow portfolio + other CDP components like Hive or Spark jobs.

How are policies set on S3 buckets in AWS by CDF?

An ID broker lets you map your internal CDP users to your internal IAM roles. So the mapping is for a specific user to a specific role that allows them to then access a specific S3 bucket.

Can you load data into Hive from NiFi?

Yes. Just use the PutHiveStreaming NiFi processor and set a few parameters.

Can I access my own internal API with NiFi?

Yes. You can write a custom NiFi processor in Java or use an HTTP processor.

Is MiNiFi support part of NiFi support?

Support for MiNiFi comes with Cloudera Edge Management (CEM). It’s not only MiNiFi but also includes Cloudera Edge Flow Manager which allows you to design edge flows centrally and push them out to thousands of MiNiFi agents you’re running. This is currently offered independently of CDP and we’re working on bringing it into the CDP experience as well. See this link for more info.

What platforms can I run the MiNiFi agent on?

Virtually any hardware or device where you can run a small C++ or Java application.

How do I get MiNiFi on my devices?

You’ll need to install MiNiFi on the devices that you need to monitor. MiNiFi is part of Cloudera Edge Management and comes with the Edge Flow Manager tool allowing you to design flows in a central place and push them out to all your agents at the same time.

Is there a way of versioning the data flows?

Yes. NiFi comes with the NiFi Registry that lets you version flows. In Data Hub this is set up automatically for you.

Where is NiFi in-transit data stored?

NiFi stores data that is flowing through in so called ‘repositories’ on local disk. In the example that was running on AWS, the NiFi instances have EBS volumes mounted where all that data is stored. NiFi also stores historic provenance data on disk so you can look up details and lineage of data long after it has been processed in the flow.

If the data ingested has a record updated, does it come back ingested as a new entry, or with PK it gets updated?

Depends on your data ingest pipeline. NiFi is able to pick up updated records and move them through its data flow. If you are sending records to Kafka, it doesn’t really care whether the record is an update or not but the downstream application would have to handle this. If you’re using Hive, you can use the Hive3Streaming processor in NiFi which is able to handle upserts. If Kudu is your target, upserts are also supported.

Does NiFi have a resource manager for different components of its pipeline?

By default, all NiFi nodes process data and NiFi is optimized to process data as quickly as possible. So it makes use of all resources that are given to it. It currently does not have an internal resource manager to assign resources to a specific flow. Going forward we’ll be running flows in their own clusters on Kubernetes to improve this experience.

Is it possible to use metadata from Atlas in NiFi?

Currently, Atlas is used to capture NiFi data provenance metadata and to keep it up to date.

When using Atlas is there a manual setup required to use NiFi and Kafka in CDF?

No. That’s the benefit of using CDF on top of the Cloudera Data Platform (CDP) public cloud.

How do I connect my own Kafka producers/consumers to a Streams Messaging cluster in CDP Public Cloud Data Hub?

You can connect Kafka clients to Streams Messaging clusters no matter where your clients are running. We have published detailed instructions here.

What's the best way to extend an existing Kafka deployment on-prem to the public cloud with CDP?

If you have an existing Kafka cluster on-premises, the best way to extend it is
1. Create a Streams Messaging cluster in CDP Public Cloud

2. Use Streams Replication Manager to setup replication between the two environments

The replication can be from on-prem to cloud, vice versa or even bidirectional. Check out this Streams Replication Manager doc for more info.

What is the value of having Atlas for provenance when NiFi already has data provenance built-in?

NiFi data provenance captures what is happening in NiFi to a very detailed level. It shows you the actual data flowing through. Atlas covers lineage on a data set level so it doesn’t contain the detailed records but rather shows you the end-to-end lineage. So you’ll see data lineage through your entire pipeline across NiFi, Hive, Kafka, and Spark.

Once a stream is processed, how can I consume this data with analytics or reporting tools from on-premise?

Depends on your pipeline. You could use NiFi to write to a data store on-prem where you already have your analytic tools connected. If you are writing data to the cloud, you can configure your analytic/reporting tools to access that data there.

What is the best option to serve trained ML models for streaming data? From within NiFi or Flink?

Both options are possible. From within a NiFi flow, you can call out to a trained model in the Cloudera Machine Learning (CML). And Flink lets you embed ML models directly in the stream.

Is NiFi good for complex transformations?

Depends on how complex 🙂 Generally, though, as complexity increases, Flink and Spark Streaming are a better fit.

Can you use NiFi for real-time as well as batch processing?

Yes. Both event driven and batch processing modes are possible in NiFi.

What is the minimum number of nodes needed for a Data Flow cluster?

The number of nodes is configurable, but we have defaults for heavy and light duty clusters for both Flow Management and Streams Messaging. See details of this node sizing and layout in the documentation: Flow Management cluster layout and Streams Messaging cluster layout.

In the case of NiFi node failure, does the data in the processing of this node automatically recover on another NiFi node?

In CDP Data Hub, yes. Since the data is stored on EBS volumes, we will replace the instance if it fails, and reattach the EBS volume to the new instance. So we automatically replace failed nodes and reattach the volume to the new NiFi node so that processing is picked up immediately after we have recovered from a failure.

Is there a GUI to track/monitor all the NiFi error flows?

Today you can do it with a ReportingTask sending information to your reporting tool of choice or a secondary database or Kafka. We are looking to release alerting and monitoring features in the next 6-12 months for public/private cloud that will work natively out of the box.

Does NiFi give a notification or email facility for a possible bottleneck or threshold reached a point in the workflow?

Yes, you can send email alerts based on failures in your NiFi flow. You can also send metrics etc. to external systems like DataDog or Prometheus

Can I create alerts on Apache Kafka topics?

Yes, with Streams Messaging Manager (SMM) it’s easy to define alerts to manage the SLAs of your applications. SMM provides rich alert management features for the critical components of a Kafka cluster including brokers, topics, consumers, and producers. You can set alerts, for example, on consumer lag of data in/out or whenever thresholds are exceeded.

Can you expose alerts in Streams Messaging Manager? Can you expose these via email, for example?

Yes, you can send alerts via notifier to an email or via an HTTP endpoint to any monitoring system you may have that accepts an HTTP request.

Is there a central way to track authentication failures?

Yes, as an admin you can use an audit view in Ranger for all authorization requests. You can then track all allowed or denied requests of your Kafka clients across the enterprise.

How do you reroute failed messages?

If you want to handle errors, you would just connect error relationships to one process group that would then handle your errors. You could then apply corrections to these failed events and try to re-process it. And if you’re working with Kafka, your events are always safe in Kafka, then you can always reprocess these events again.

Where are parameters set in NiFi?

Parameters are configured in the parameter context that can hold a list of parameters and their values. You can then assign a parameter context to a specific processor group.

Cloudera Community

Community Articles

Cloudera DataFlow (CDF) - Questions & Answers

Apache NiFi

Apache Spark

Cloudera Enterprise Data Hub

Cloudera Streaming Analytics (CSA)

How to set a processor to DEBUG when on Cloudera D...

Cloudera Machine Learning (CML) - Questions & Answ...

Cloudera Data Warehouse (CDW) - Questions & Answer...

Cloudera Operational Database (COD) - Questions & ...

Cloudera Data Engineering (CDE) - Questions & Answ...

Common LLAP questions answered

Operationalize NiFi data flows with Cloudera DataF...

Answers to Frequently Asked Questions About Cloude...

Benefits of Cloudera DataFlow over a Flafka (Flume...

New Cloudera AMP with Amazon Bedrock Integration N...