Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Productionizing Apache Nifi

Solved Go to solution

Productionizing Apache Nifi

Contributor

We have created lot of poc's for nifi like KafkatoS3, SQLServertoS3, KafkatoCassandra etc. We are planning to use nifi for few more usecases. Currently in poc we are using single instance of nifi in EC2 instance but if we go live we have around terabytes of SQLServer db data and also we will be having multiple data sources like Kafka, SQL Server.

1. Can a single instance of nifi handle huge volumes of data like exporting terabytes of sqlserver data to s3 as batches.

2. I think nifi will be sitting on edge node rather than inside the cluster plz confirm?

3. Is any organization currently using nifi in production?

4. What are the best practices to productionize nifi?

5. To handle large data and data from multiple sources do i need to have multiple instances of nifi?

Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Productionizing Apache Nifi

Master Guru

i recommend setting up a NiFi cluster that will spread the load across multiple resources. This removes that single point of failure caused by only having one ec2 instance running a NiFI.

Now whether a single ec2 instance with NiFi can run your dataflows really depends a lot on your data and what your specific dataflows looks like.for example are you doing a lot of CPU or memory intensive processing in your NiFi dataflows?

A good approach is having NiFi sitting on edge systems feeding a central NiFi processing cluster.

View solution in original post

4 REPLIES 4
Highlighted

Re: Productionizing Apache Nifi

Master Guru

i recommend setting up a NiFi cluster that will spread the load across multiple resources. This removes that single point of failure caused by only having one ec2 instance running a NiFI.

Now whether a single ec2 instance with NiFi can run your dataflows really depends a lot on your data and what your specific dataflows looks like.for example are you doing a lot of CPU or memory intensive processing in your NiFi dataflows?

A good approach is having NiFi sitting on edge systems feeding a central NiFi processing cluster.

View solution in original post

Highlighted

Re: Productionizing Apache Nifi

Contributor

@mclark i believe it will be lot of CPU as oppose to memory as we are exporting data using Nifi to different sinks from various sources. So if we put nifi on edge node,

1. i believe we need multiple edge nodes where we have a nifi cluster on edge nodes and another cluster which might run storm or spark to receive and process the data?

2. How about in the case of just data transportation i.e. transferring raw json data from kafka to s3 or sql server data from around 50 tables to S3. How do we setup nifi in this instance?

thank you

Highlighted

Re: Productionizing Apache Nifi

@BigDataRocks @mclark

You will need to plan your nifi production cluster based on your volume requirements.

- If you are just looking to transfer huge volume of data from a source to sink, you need to ensure you have enough space available for content repository. Also, ensure that your content repository is setup on a separate disk from flowfile and provenance repository.

- Also from productionizing perspective, it is important to have error handling built in your flows, so your teams can get notified in case of errors and any errors are logged to logs.

- It will probably be better to run multiple instances of nifi for data from multiple sources, because currently nifi doesn't offer security based on flows. In current security model, flow administrator will have access to all flows running on one instance. By running multiple instances of nifi, you can control security for each flow.

- If you decide to use one instance of nifi, you can use ProcessGroups to organize your dataflows.

- You should also think about setting up MonitorTasks for disk usage and memory that will give you warnings at appropriate thresholds.

- For dataflows, with significant processing requirements, you will need a cluster setup to distribute load across different nodes. You can also increase number of concurrent tasks for any processor that requires more processing power.

Highlighted

Re: Productionizing Apache Nifi

Contributor

@Shishir Saxena Thanks for the answer. How error handling is built in NiFi flows? Whats content & flowfile repository i mean which data is stored in these repositories by Nifi? Thank you

Don't have an account?
Coming from Hortonworks? Activate your account here