We have set up a NIFI cluster with a NCM and two slave nodes.We have created a simple flow in the NCM GUI to pull file from the local file system from the NCM and after updating some attributes load into the local file system of one of the slave node. The GetFile processor is unable to recognize the input directory that refers to a path in the file system of the NCM. Is it possible to set the 'Input Directory' property in GetFile so that it points to a specific path in the local file system of either the NCM or the slave nodes?
The purpose of the NCM is solely to coordinate the flows on the DFM nodes. The DFM nodes then run the flows, so anything in those flows is strictly from the context of the DFM. In other words, the Input Directory in GetFile will always refer to the local filesystem on the 'slave' nodes. You may want to consider using a file share mounted on both nodes for this.
One other thing to note is that if you need a single point of ingest, you can use GetFile scheduled to only run on the primary node. This will however mean that all processing only happens on one node, which is probably undesirable. A better model for the problem is to use ListFile on a primary node, against a shared directory location, then use site-to-site back to the same cluster to load-balance a FetchFile processor which continues the flow, hydrating the flowfiles with the content of the file from a shared spooling directory, and doing whatever other processing is required.
Thanks @Simon Elliston Ball
for the input. I have a couple of queries regarding the second model(using listfile+fetch file)that you suggested:
You can think of the NCM as your command and control of all the connected Nodes in your cluster. The NCM itself does not process any data or run any processors in your dataflow. When you added the GetFile processor to the canvas via the NCM UI, the NCM's job was to make that request to add that processor to every Node. The GetFile processor when started is only running on the Nodes and will only check the local file system on each node in the directory configured for any files. Because of the functional responsibilities of the NCM, the NCM hardware requirements are much lighter then your nodes. The NCM will never write any data to the content, flowfile, or provenance repository like your nodes will so very little hard drive space is needed. Since it is not running any processors, the NCM's CPU needs are lees then your Nodes. The NCM will still need a good size heap for retaining components state reported to it from the nodes via heartbeats. Since the resource requirements are light for the NCM, it is not uncommon to see a node also installed on the same server as hosting the NCM. NiFi 1.0 (HDF 2.0) will introduce a new framework with many new enhancements. One of these changes is zero master clustering which eliminates the need for a NCM.