About MattWho

MattWho · ‎04-10-2019

@Kevin Lahey 1. Each NiFi node in a cluster runs its own copy of the flow.xml and processes its own set of FlowFiles. Node are unaware of what FlowFiles exist on other nodes in the cluster. 2. In much older versions of NiFi (Apache 0.x versions), NiFi did not have any High availability at the control level within a cluster. There existed a dedicated NiFi instance known as the NiFi Cluster Manager (NCM). This was the only instance in the NiFi cluster that could be accessed. All the nodes connected to this NCM. If NCM went down the entire NiFi cluster was not reachable. As of Apache NiFi 1.x+ the NCM no longer exists and the cluster relies on Zookeeper to elect one of the cluster nodes to handle role of Cluster Coordinator and Primary node. If the currently elected node(s) for these roles goes down, a new load is elected to these roles. In this way HA at the control level was provided. When you create any component (processor, controller service, reporting task, etc...), those components are replicated to all nodes in the cluster. So yes, the DistributedMapCacheServer controller service would be running on all nodes. If you then configured the DistributedMapCacheClient to use "localhost", then each node would be reading and writing to different cache servers. The DistributedMapCacheClient should be configured to point at a specific node rather than localhost. As you can see you have no HA in this type of setup since you are dependent on that one node hosting the cache server you are using to always be up. Instead you shoudl be using one of the external cache options like HBase in order to have HA. 3. As explained above, there is not such thing as a NCM as of Apache NiFi 1.x+ 4. Every component you add to the NiFi canvas is running within a single JVM on each NiFi node. So you cannot configure multiple components that bind to the same configured port anywhere. The first component will bind to port and when the other components are started they will throw an exception about port already in use. You can have as many clients (DistributedMapCache Client) as you like, since they act as a client and do not bind to a port. Only the server binds to the port so it can listen for client requests. Hope this helps

MattWho · ‎04-09-2019

@Abhinav Joshi *** Community Forum Tip: Try to avoid starting a new answer in response to an existing answer. Instead use comments to respond to existing answers. There is no guaranteed order to different answer which can make it hard following a discussion. I would suggest searching the nifi/work directory for multiple versions of the update-attribute nar bundle. You may have multiple nars of different versions installed. The flow.xml.gz file does contain the specific processor version for each component. When starting NiFi 1.9 using the flow.xml.gz from another NiFi version, the component versions will automatically be updated to the new version only if a single option exists. If you have an updateAttribute-1.8.<custom> and an updateAttribute-1.9.0 version available and the flow.xml.gz has an updateAttribute-1.8.0 then it will not auto-update because there are two options and it does not know which should be used. - My guess here is that your NiFi 1.8.0 contained both the standard 1.8.0 version of the the updateAttribute processor and a custom version of the updateAttribute processor. Then your flow contained updateAttribute components of each, Then you upgraded to NiFi 1.9.0 which replaced the stock updateAttribute with 1.9.0 and the custom version of Update Attribute processor was also carried over to your NiFi 1.9.0 install. - Thanks, Matt

MattWho · ‎04-09-2019

@Kevin Lahey I completely agree with @Shu. I sounds like you have ListS3 processor executing on all 4 nodes in a NiFi cluster. This results in each NiDi node listing the same filename. This means that each node is then trying to lookup that filename in the distributed cache used by the detectDuplicate processor. This results in a bit of a race condition between you nodes where one or more nodes fails to find entry in cache before 1 of the nodes adds this new filename to that cache. - You flow should be running the ListS3 processor with it success relationship feeding a FetchS3 processor. That connection between those two processors should be configured to load balance the listed files across all nodes in cluster. - Thanks, Matt

MattWho · ‎04-08-2019

@Abhinav Joshi A processor will appear with a dashed line around perimeter of the processor box for a couple reasons 1. It is a ghost implementation --> when loading the flow.xml.gz a processor is encountered that uses a custom nar which does not exits in the current NiFi installation. 2. The user logged into the canvas does not have required permissions to view the component. - Based on your description, it does not sound like scenario 2. The question is why did NiFi 1.9 not find a processor in NiFi 1.9 that used same processor class. Since you upgraded from NiFi 1.8, the update Attribute processor should have referenced the org.apache.nifi.processors.attributes.UpdateAttribute class for version 1.8.x. During an upgrade of NiFi the same identical class would have been found except for a newer version. In this case NiFi would have automatically switched to using the new version. - So this raises two questions: 1. Did you customized version of the updateAttribute processor running in NiFi 1.8? 2. Do you have multiple copies of the same processor class but with different versions loaded in your NiFi 1.9? - I would suggest inspecting the nifi-app.log from the time of startup immediate following the upgrade. If the UpdateAttribute processors was replaced by a "ghost" processor you would see that logged in the nifi-app.log. This log should help to see why it chose to load a ghost processor. - Thank you, Matt - If you found this answer addressed your question, please take a moment to login in and click the "ACCEPT" link.

MattWho · ‎04-08-2019

@Abhinav Joshi What was the reason given for why the UpdateAttribute processors were now invalid?

MattWho · ‎04-01-2019

@Isha Tiwari In order to merge FlowFiles that exist on multiple nodes in your cluster you are going to need to move all FlowFiles to one node. Apache NiFi 1.9.x versions introduced a new "Load Balanced" configuration option on dataflow connections. One of the options for the configurable "Load Balance Strategy" is "Single node". Setting this strategy will route all queued FlowFiles to one node in the cluster. You could set this on the connection feeding your Merge processor. - In Apache NiFi 1.8 and older you would need to use the PostHTTP processor (configured to send as FlowFile) to send all FlowFiles to a ListenHTTP processor running at one of your nodes URL (processor ill run on al nodes, but your postHTTP will only be configured with URL for one node). Problem with this solution is that if the target URL server goes down, your dataflow will stop working. - Thank you, Matt

MattWho · ‎03-29-2019

@Isha Tiwari Did you change your Max bin age setting to a value higher than 1 minutes? Try setting it to 10 minutes. Is your NiFi a standalone instance or a NiFi cluster? Keep in mind that each Node in a NIFi cluster runs is own copy of the flow.xml.gz and works on it own set of FlowFiles. So the merge processor can only bin and merge the FlowFiles local to each node. Thanks, Matt

MattWho · ‎03-28-2019

@Isha Tiwari - The "Max" configuration properties do not govern how long a bin waits to be merged. The Merge based processors work as follows: - 1. The processor executes based upon the configured run Schedule. 2. At exact time of execution the Merge processor looks at what FlowFiles exist on Inbound connection that have not already been allocated to a Merge processor bin. 3. Those FlowFiles are then allocated to one or more bins. The max Bin size and Max number records create a ceiling for how many FlowFiles can be allocate to a bin. If a bin has reached one of these max values, additional FlowFile in this current execution start getting allocated to a new bin. 4. Once all FlowFiles from this current execution (Thread does not keep checking for new FlowFiles coming in to inbound connection. Those new FlowFiles would be handled by next execution) have been allocated to one or more bins, those bins are evaluated to see if they are eligible to be merged. In order to be eligible the bin must meet both minimum settings for size and number of records or the max bin age has been reached. In your case, a bin could be binned with only 20 records and 20 KB of size or if a bin has existed for at least 1 minute. - If you find your merging small bins consistently, changing the run schedule on your merge processor should help. This would allow more time between executions for FlowFiles to queue on the inbound connection. - IMPORTANT: Keep in mind that all FlowFiles allocated to bins are bing held in heap memory (swapping does not occur with bins). Specifically the FlowFile attributes/Metadata is the portion of the FlowFile held in heap memory. Your max records of 100,000 could result in considerable heap pressure. Using two Merge processors in series could achieve same result with lower heap usage. - I use MergeContent in following Article about connection queues as an example: https://community.hortonworks.com/articles/184990/dissecting-the-nifi-connection-heap-usage-and-perf.html - Thank you, Matt -If you found this answer addressed your question, please take a moment to login in and click the "ACCEPT" link.

MattWho · ‎03-22-2019

@Mario Tigua The File Filter property in the listFile processor does not support NiFi Expression Language. If yo float your cursor of the question mark icon to the right of a property name it will display a pop-up window that will tell you if this property will support NiFi expression language. - This property expects a java regular expression instead. - Thank you, Matt - If you found this answer addressed your question, please take a moment to login in and click the "ACCEPT" link.

MattWho · ‎03-22-2019

@sri chaturvedi - The question here what is the use case for needing to have a sequence number across entire cluster. Why not generate a sequential number per node to keep track of per node batches. Maybe use the NiFi hostname in the sequence number identifier? - Maybe you can share some more details on the full use case for these sequence numbers. Why you are generating and what they are being used for. - If you use the distributedMapCache, you could keep three different sequential number cached values (each node has its own sequence number stored in a cache entry by hostname. - You could then build a flow that fetches all three value add adds them together for you on an hourly/daily/weekly schedule? - Thank you, Matt

Online	Offline
Last Visited	‎05-04-2026 05:42 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎05-04-2026 05:42 PM
Posts	3,466
Kudos received	1637

Cloudera Community

Re: FetchSMB not fetching all files

Re: Nifi: How to revoke the import and export Temp...

Re: Setting TTL per key when writing to redis

Re: Authorization issue between NiFi and NiFi Regi...

Re: Best Practice for configuring registry flows

Re: Why is DetectDuplicate not filtering duplicate...

Re: Update Attribute processor Not working after u...

Re: Why is DetectDuplicate not filtering duplicate...

Re: Update Attribute processor Not working after u...

Re: Update Attribute processor Not working after u...

Re: Nifi MergeRecord processor not merging as per ...

Re: Nifi MergeRecord processor not merging as per ...

Re: Nifi MergeRecord processor not merging as per ...

Re: Get files using ListFile but filtering by its ...

Re: how to generate sequence number in nifi cluste...