Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to distribute files on NiFi cluster and process each file entirely on a node?

avatar
Explorer

I have a NiFi cluster and want to process files that I store in a directory on one node to be distributed in the cluster.

To keep the question short:

  • Input files: 20ish files are coming in a batch each time
  • The data flow works fine when I run it on one node (not distributed)
  • The output will be stored on HDFS, so that's not a problem. Just that my input files will be distributed evenly to the NiFi nodes so the processing will be fast.
  1. How can I have it that the number of input files get split somewhat evenly on the NiFi nodes and each runs the entire flow for each file one by one? What I mean is that one of my processors need the entire file to be present on the node not just parts of the file.
  2. Should I upload all my input files to the primary node? How do files get distributed? A downside of this would be that all files need to be distributed in between nodes (this is doable though, the files are not that big).
  3. Another idea would be to store the input files in a specific directory on each node in the cluster. But then the "import" script that puts the files would need to know all nodes in the cluster and know about downtimes of the nodes as well...

Thanks in advance

1 ACCEPTED SOLUTION

avatar
Master Guru

There are two solutions that would work well here...

1) Have the the import process distribute the files evenly to all the NiFi nodes, then each NiFi node doesn't have to worry about anything and just processes the files on the local file system of that node. I think this is what you meant in #3.

2) Mount a shared network drive to all the nodes, upload the files to the shared drive, then use ListFile running on primary node only to list the remote directory, followed by an Remote Process Group to distribute the listings to all the NiFi nodes, then a FetchFile for each node to retrieve the listings.

More details on the List+Fetch pattern are here:

https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

View solution in original post

8 REPLIES 8

avatar
Master Guru

There are two solutions that would work well here...

1) Have the the import process distribute the files evenly to all the NiFi nodes, then each NiFi node doesn't have to worry about anything and just processes the files on the local file system of that node. I think this is what you meant in #3.

2) Mount a shared network drive to all the nodes, upload the files to the shared drive, then use ListFile running on primary node only to list the remote directory, followed by an Remote Process Group to distribute the listings to all the NiFi nodes, then a FetchFile for each node to retrieve the listings.

More details on the List+Fetch pattern are here:

https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html

avatar
Explorer

Two more questions came to my mind:

  1. Is it better to put my input files to HDFS first, as shown in your link, instead of a traditional shared network drive?
  2. Can I use Site-To-Site transfer in the following flow? Network latency might be an issue here. Is this option not viable at all?
  • GetFile on primary NiFi cluster node -> receives new files on the primary node
  • RPG to input ports -> pushes data flowfiles from the primary node to all other nodes in the cluster
  • After that, proceed with my original data flow as the data is now distributed in the NiFi cluster

Thanks Bryan!

avatar
Master Guru

1) Doesn't really matter, it just needs to be shared location that all nodes can access.

2) You can do this and it might work well for small files and small amounts of files, but typically the whole point is to perform the "fetch" in parallel, where as here all the files have to be fetched on primary node (GetFile) and then all of their contents have to be redistributed to the cluster, instead of just the listings.

avatar
Explorer

One more thing that I just read here:

"Note: that would be an ideal case in terms of balancing but, for efficiency purpose, the Site-to-Site mechanism might send batch of flow files to the remote node. In the above example, with only 3 flow files, I would probably not end up with one flow file per node."

Is this a problem when I need the entire content of my files present in one flowfile?

For example, I have JSON formatted files that will be converted to JSON by using the NiFi ConvertJSONToAvro processor. When the JSON file gets splitted by the Site-to-Site mechanism, I would get more than one output Avro file for each JSON input file, right? Is it possible to merge the content to one single avro file again? For example, with the MergeContent processor.

For more information: I might need the entire Avro file in one big file to process it with a Python script. The python script will export the Avro file to another scientific format. Thanks again!

avatar
Master Guru

Site-To-Site does not do anything to the contents of your flow files, if you have 3 flow files then it transfers 3 flow files. That statement is saying that site-to-site is optimized for a continuous flow of large amounts of data, so if you run a test with 3 flow files, it probably will send all 3 flow files to only of the nodes in your cluster because it wasn't enough data to reach the point where it would start sending to the other nodes.

avatar

Remote Process Group to distribute the listings to all the NiFi nodes -> how to distribute these files you have not mention any process to do this. We need configuration to do this.please mention.

avatar
Master Guru

I believe it was mentioned - "Remote Process Group to distribute the listings to all the NiFi nodes, then a FetchFile for each node to retrieve the listings."

avatar

Remote Process Group to distribute the listings to all the NiFi nodes -> how to distribute these files you have not mention any process to do this. We need configuration to do this.please mention.