Created 06-21-2017 08:49 AM
I have a NiFi cluster and want to process files that I store in a directory on one node to be distributed in the cluster.
To keep the question short:
Thanks in advance
Created 06-21-2017 01:47 PM
There are two solutions that would work well here...
1) Have the the import process distribute the files evenly to all the NiFi nodes, then each NiFi node doesn't have to worry about anything and just processes the files on the local file system of that node. I think this is what you meant in #3.
2) Mount a shared network drive to all the nodes, upload the files to the shared drive, then use ListFile running on primary node only to list the remote directory, followed by an Remote Process Group to distribute the listings to all the NiFi nodes, then a FetchFile for each node to retrieve the listings.
More details on the List+Fetch pattern are here:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
Created 06-21-2017 01:47 PM
There are two solutions that would work well here...
1) Have the the import process distribute the files evenly to all the NiFi nodes, then each NiFi node doesn't have to worry about anything and just processes the files on the local file system of that node. I think this is what you meant in #3.
2) Mount a shared network drive to all the nodes, upload the files to the shared drive, then use ListFile running on primary node only to list the remote directory, followed by an Remote Process Group to distribute the listings to all the NiFi nodes, then a FetchFile for each node to retrieve the listings.
More details on the List+Fetch pattern are here:
https://community.hortonworks.com/articles/16120/how-do-i-distribute-data-across-a-nifi-cluster.html
Created 06-23-2017 09:02 AM
Two more questions came to my mind:
Thanks Bryan!
Created 06-23-2017 01:43 PM
1) Doesn't really matter, it just needs to be shared location that all nodes can access.
2) You can do this and it might work well for small files and small amounts of files, but typically the whole point is to perform the "fetch" in parallel, where as here all the files have to be fetched on primary node (GetFile) and then all of their contents have to be redistributed to the cluster, instead of just the listings.
Created 06-23-2017 09:25 AM
One more thing that I just read here:
"Note: that would be an ideal case in terms of balancing but, for efficiency purpose, the Site-to-Site mechanism might send batch of flow files to the remote node. In the above example, with only 3 flow files, I would probably not end up with one flow file per node."
Is this a problem when I need the entire content of my files present in one flowfile?
For example, I have JSON formatted files that will be converted to JSON by using the NiFi ConvertJSONToAvro processor. When the JSON file gets splitted by the Site-to-Site mechanism, I would get more than one output Avro file for each JSON input file, right? Is it possible to merge the content to one single avro file again? For example, with the MergeContent processor.
For more information: I might need the entire Avro file in one big file to process it with a Python script. The python script will export the Avro file to another scientific format. Thanks again!
Created 06-23-2017 01:46 PM
Site-To-Site does not do anything to the contents of your flow files, if you have 3 flow files then it transfers 3 flow files. That statement is saying that site-to-site is optimized for a continuous flow of large amounts of data, so if you run a test with 3 flow files, it probably will send all 3 flow files to only of the nodes in your cluster because it wasn't enough data to reach the point where it would start sending to the other nodes.
Created 09-27-2017 11:22 AM
Remote Process Group to distribute the listings to all the NiFi nodes -> how to distribute these files you have not mention any process to do this. We need configuration to do this.please mention.
Created 09-27-2017 01:12 PM
I believe it was mentioned - "Remote Process Group to distribute the listings to all the NiFi nodes, then a FetchFile for each node to retrieve the listings."
Created 09-27-2017 11:23 AM
Remote Process Group to distribute the listings to all the NiFi nodes -> how to distribute these files you have not mention any process to do this. We need configuration to do this.please mention.