Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

File Duplication in MultiNode Cluster

avatar

I am working off of a 3-node NiFi cluster and that is kicked off by a GenerateFlowfileProcessor run on the primary node, performs some NiFi processing, and then writes the files to the server that I will then run an ExecuteStreamCommand Python script on. The problem I’m running into is I can’t figure out a way to ensure that the processors picking up the first output are run on the same node as the processors that produced the first output. 

  1. What is the best way to handle producing files that can be accessed by all nodes?
  2. Is there a way to specify the node for a process will be run on? (using “run on primary” is not working as the primary node cycles over the process)
1 ACCEPTED SOLUTION

avatar
Super Mentor

@TRSS_Cloudera 

Your use case is not completely clear to me.

Each Node in a NiFi cluster executes its own copy of the dataflow against its own set fo FlowFiles (FlowFiles are what the NiFI components execute upon).  NiFi components can be processors, controller services, reporting tasks, input/output ports, RPG, etc.  Each node maintains its own set of repositories.  Two of those repositories (flowfile_repository and content_repository) hold the parts that make up a FlowFile.

In a NiFi cluster a node will always get elected as the Cluster Coordinator or Primary Node (sometimes one node is elected for both these roles)  Which node is elected to either role can change at anytime.

Your GenerateFlowFIle processor you have configured to execute on "Primary Node" only will produce FlowFile(s) only on the currently elected primary node.  From your description, you dod not cover how your dataflow writes the files to the server that you will then run an ExecuteStreamCommand Python script on.
 

  1. What is the best way to handle producing files that can be accessed by all nodes?
    Answer: Since each node operates on its own FlowFiles, one node will not have access to FlowFiles on the other nodes. A clearer use case as to why you would want every node processing the same FlowFile might be helpful here. 
  2. Is there a way to specify the node for a process will be run on? (using “run on primary” is not working as the primary node cycles over the process)
    Answer: Only processors that are responsible for creating the FlowFile should ever be scheduled to execute on the "Primary Node". Any processor that accepts and inbound connection should always be executing on all nodes.  So if Node A is current Primary Node and a FlowFile is produced by a primary node only configured processor, the FlowFile would still be processed downstream in the dataflow even if a primary node change happens.

 

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

View solution in original post

2 REPLIES 2

avatar
Super Mentor

@TRSS_Cloudera 

Your use case is not completely clear to me.

Each Node in a NiFi cluster executes its own copy of the dataflow against its own set fo FlowFiles (FlowFiles are what the NiFI components execute upon).  NiFi components can be processors, controller services, reporting tasks, input/output ports, RPG, etc.  Each node maintains its own set of repositories.  Two of those repositories (flowfile_repository and content_repository) hold the parts that make up a FlowFile.

In a NiFi cluster a node will always get elected as the Cluster Coordinator or Primary Node (sometimes one node is elected for both these roles)  Which node is elected to either role can change at anytime.

Your GenerateFlowFIle processor you have configured to execute on "Primary Node" only will produce FlowFile(s) only on the currently elected primary node.  From your description, you dod not cover how your dataflow writes the files to the server that you will then run an ExecuteStreamCommand Python script on.
 

  1. What is the best way to handle producing files that can be accessed by all nodes?
    Answer: Since each node operates on its own FlowFiles, one node will not have access to FlowFiles on the other nodes. A clearer use case as to why you would want every node processing the same FlowFile might be helpful here. 
  2. Is there a way to specify the node for a process will be run on? (using “run on primary” is not working as the primary node cycles over the process)
    Answer: Only processors that are responsible for creating the FlowFile should ever be scheduled to execute on the "Primary Node". Any processor that accepts and inbound connection should always be executing on all nodes.  So if Node A is current Primary Node and a FlowFile is produced by a primary node only configured processor, the FlowFile would still be processed downstream in the dataflow even if a primary node change happens.

 

If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post.

Thank you,

Matt

avatar
Community Manager

@TRSS_Cloudera Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.  



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community: