About MattWho

saquibsk · ‎05-21-2024

How to Resolve SNI issue when upgrading to NiFi 2.0 https://medium.com/@chnzhoujun/how-to-resolve-sni-issue-when-upgrading-to-nifi-2-0-907e07d465c5#:~:text=Due%20to%20the%20upgrade%20to,400%3A%20Invalid%20SNI%20will%20occur

MattWho · ‎05-20-2024

@manishg How many cpu cores does each of your NiFi hosts have? 1 means you are using 100% of 1 cpu on average. 20 means you are using 100% of 20 cores on average. etc... so lets say your node has 8 cores but your load average is higher then 8, this means your cpu is saturated and being asked to perform more work then can be handled efficiently. This leads to long thread execution times and can interfere with timely heartbeats being sent by nodes or processed by the elected cluster coordinator. Often times this is triggered by too many concurrent tasks on high CPU usage processors, high FlowFile volume, etc. You can ultimately design a dataflow that simply needs more CPU then you have to work at the throughput you need. User commonly just start configuring more and more concurrent tasks and set the Max Timer Driven thread pool way to high for the number of cores available on a node. This allows more threads to execute concurrently, but just results in each thread taking longer to complete as their time is sliced on the CPU. thread 1 gets some time on CPU 1 and then goes to time wait as another thread gets some time, eventually thread 1 will get a bit more time. More millisecond threads that is not a big deal, but for CPU intensive processors it can cause issues. Lets say you have numerous CPU intensive thread executing at same time, and the heartbeat is scheduled. the scheduled thread is now waiting in line for time on the CPU. Sometimes Alternate dataflow design can be used that use less CPU. Sometimes you can add more nodes. Sometimes you need to move some dataflows to different cluster. Sometimes you just need more CPU. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-20-2024

Hello @hegdemahendra Always very helpful if you include the exact version of Apache NiFI, Cloudera HDF, or Cloudera CFM being used. My guess here would be one or both of the following: You have multiple FlowFiles all pointing at the same content claims queued in connections within your dataflow(s) on the canvas. As long as a FlowFile exists on the canvas it will exist in flowfile_repository. Users should avoid leaving FlowFiles queued in connection on NiFi. Some users tend to allow FlowFile to accumulate at stopped processor components rather then auto-terminate them. Even if a FlowFile does not have any content its FlowFile attributes/metadata still consume disk space. You are extracting content from your FlowFiles into FlowFile attributes resulting in large FlowFile attribute/metadata being stored in the flowfile_repository. Dataflow designers should avoid extracting large amounts flowfile content in to the FlowFile's attributes. Instead try to build dataflows and utilize components that read content from the FlowFile's content instead of from FlowFile attributes. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-20-2024

@galt @RAGHUY Let me add some correction/clarity to the accepted solution. Export and Modify Flow Configuration: Export the NiFi flow configuration, typically in XML format. This can be done via the NiFi UI or by utilizing NiFi's REST API. Then, manually adjust the XML to change the ID of the connection to the desired value. It is not clear here what is being done. The only way to export a flow configuration from NiFi in XML format is via generating a NiFi template (deprecated and removed in Apache NIFi 2.x versions). Even if you were to generate a template and export is via NiFi UI or NiFi's rest-api, modifying it will not change what is on the canvas. If you were to modify the connection component UUID in all places in the template. Upon upload of that template back in to NiFI, you would need to drop the template on the the canvas which would result in every component in that template getting a new UUID. So this does not work. In newer version of NiFi 1.18+ NiFi supports newer flow definitions which are in json format. but same issue persists here when using flow definitions in this manor. In a scenario like the one described in this post where user removed a connection by mistake and then re-created it, the best option is to restore/revert the previous flow. Whenever a change is made to the canvas, NIFi auto archives the current flow.xml.gz (legacy) and flow.json.gz (current) file in to an archive sub-directory and generates a new flow.xml.gz/flow.json.gz file. Best and safest approach approach is to shutdown all nodes in your NiFi cluster. Navigate to the NiFi conf directory and swap current flow.xml.gz/flow.json.gz files with the archived flow.xml.gz/flow.json.gz files still containing the connection with original needed ID. When above is not possible (maybe change went unnoticed for to long and all archive version have new connection UUID), you need to manually modify the flow.xml.gz/flow.json.gz files. Shutdown all your NiFi nodes to avoid any changes being made on canvas while performing following steps. Option 1: Make backup of current flow.xml.gz and flow.json.gz Search each file for original UUID to make sure it does not exist. On one node manually modify the flow.xml.gz and flow.json.gz files by locating the current bad UUID and replacing it with the original needed UUID. Copy the modified flow.xml.gz and flow.json.gz files to all nodes in the cluster replacing original files. this is possible since all nodes run same version of flow. Option 2: same as option 1 same as option 1 same as option 1 Start NiFi only on the node where you modified the flow.xml.gz and flow.json.gz files. On all other nodes still stopped, remove or rename the flow.xml.gz and flow.json.gz files. Start all the remaining nodes. since they do not have a flow.xml.gz or flow.json.gs to load, they will inherit the flow from the cluster as they join the cluster. NOTE: The flow.xml.gz was replaced by the newer flow.json.gz format starting with Apache NiFi 1.16. When NiFi is 1.16 or newer is started with and only has a flow.xml.gz file, it will load from flow.xml.gz and then generate the new flow.json.gz format. Apache NiFi 1.16+ will load only from the flow.json.gz on startup when that file exists, but will still write out both the flow.xml.gz and flow.json.gz formats anytime a change is made to the canvas. With Apache NiFi 2.x+ version the flow.xml.gz format will go away. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-20-2024

@SAMSAL This is not a new problem, but rather something that has existed with NiFi on Windows fro a very long time. You'll need to avoid using space in directory names or warp that directory name in quotes to avoid the issue. NIFI-200 - Bootstrap loader doesn't handle directories with spaces in it on Windows Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

jonay__reyes · ‎05-07-2024

Thanks @MattWho for the quick response. The short version of "why does it doesn't work" means is that nifi keeps working, recognizes the addition of the NARs but no "1.24" versions of components are available in the Processor selection window. I promptly tried the "M1" download workaround and found this as soon as I wget'ed the NAR: text version of the above: INFO Found ./extensions/nifi-standard-shared-nar-2.0.0-M1.nar in auto-load directory INFO Starting load process for 1 NARs... INFO Creating class loaders for 1 NARs... WARN Unable to resolve required dependency 'nifi-standard-services-api-nar'. Skipping NAR '/usr/local/lib/nifi-2.0.0-M2/./work/nar/extensions/nifi-standard-shared-nar-2.0.0-M1.nar-unpacked' INFO Successfully created class loaders for 0 NARs, 1 were skipped INFO Finished NAR loading process! so, since it asks for nifi-standard-shared, I'll download it also and: text version: INFO Found ./extensions/nifi-standard-services-api-nar-2.0.0-M1.nar in auto-load directory INFO Starting load process for 1 NARs... INFO Including 1 previously skipped bundle(s) INFO Creating class loaders for 2 NARs... WARN While loading 'org.apache.nifi:nifi-standard-services-api-nar:2.0.0-M1' unable to locate exact NAR dependency 'org.apache.nifi:nifi-jetty-bundle:2.0.0-M1'. Only found one possible match 'org.apache.nifi:nifi-jetty-bundle:2.0.0-M2'. Continuing... INFO Loaded NAR file: /usr/local/lib/nifi-2.0.0-M2/./work/nar/extensions/nifi-standard-services-api-nar-2.0.0-M1.nar-unpacked as class loader org.apache.nifi.nar.NarClassLoader[./work/nar/extensions/nifi-standard-services-api-nar-2.0.0-M1.nar-unpacked] INFO Loaded NAR file: /usr/local/lib/nifi-2.0.0-M2/./work/nar/extensions/nifi-standard-shared-nar-2.0.0-M1.nar-unpacked as class loader org.apache.nifi.nar.NarClassLoader[./work/nar/extensions/nifi-standard-shared-nar-2.0.0-M1.nar-unpacked] INFO Successfully created class loaders for 2 NARs, 0 were skipped EDIT: I also downloaded the jetty package, but: INFO Starting load process for 1 NARs... ERROR Found a Jetty NAR, will not auto-load /usr/local/lib/nifi-2.0.0-M2/./extensions/nifi-jetty-bundle-2.0.0-M1.nar INFO No NARs were unpacked, nothing to do so, it looks like it's acknowledging the NARs I provide, but after reloading the interface: You've been very helpful (and quick) providing the causes and several workarounds (I'll go the "build your URL on the fly" way), so thank you a lot and my double "accept as solution" click 🙂

MattWho · ‎05-03-2024

@manishg The elected cluster coordinator (elected by Zookeeper) is responsible for receiving and processing heartbeats from other nodes in the cluster. It handles the connecting, reconnecting, and manual disconnecting of NiFi nodes. The Cluster coordinator is also responsible for replicating user request to all nodes in the cluster and get confirmation from those nodes that the request was completed successfully. Assume a 3 node cluster with following: node1 - elected cluster coordinator node2 - elected primary node node3 Role of Cluster Coordinator: A user can access the NiFi cluster via any of the 3 node's URL. So lets say a user logs in node3's UI. When that user interacts with node 3 UI that request is proxied to the currently elected cluster coordinator node that in turn replicates the request all 3 nodes (example: add a processor, configure a processor, empty a queue, etc...). If one of the nodes were to fail to complete the request, that node would get disconnected. I may attempt to auto-reconnect later (In newer version of NiFi a connecting node can inherit the clusters flow and replace it local flow only if doing so would not result in dataloss. Role of the Primary Node: The elected primary node is responsible for scheduling the execution of any NiF component processor on the canvas that is configured for primary node only. This is configured in a processor's configuration "scheduling" tab: Primary node scheduled processors with display a "P" in the upper left corner as seen above. NOTE: ONLY processors with no inbound connections should ever be set to "primary node" execution. Doing so on processor with inbound connection can lead to FlowFiles becoming stuck in those connection when the elected primary node changes. Not all protocols are "cluster" friendly, so primary node execution helps dataflow designers work around that limitation while still benefiting from having a multi-node cluster. NiFi has numerous "List<XYZ>" and "Fetch<XYZ>" type processor typically used to handle non cluster friendly protocols. I'll use ListFile and FetchFile as an example. Let say our 3 node cluster as the sane network directory mounted to every node. If I was to add the ListFile processor and leave it configured with "all nodes" execution and configure it to list files on that shared mount. All three nodes in the NiFi cluster would produce FlowFiles for all the files listed (so you have files in triplicate). Now if i were to configure my ListFile with "primary node" execution, the listFile would only get scheduled to execute on the currently elected primary node (these processor also record cluster state in ZK so that if a elected primary node changes it doe snot result in a re-listing of the same files again). To prevent overloading the primary node, the list based processors do not retrieve the source content. It only creates a 0 byte FlowFile with attributes/metadata about the source file. So the List based processor would then be connected downstream to its corresponding FetchFile processor. The FetchFile for example would the use the metadata from the 0 byte FlowFile to fetch the content and add it to the FlowFile. On the connection between ListFile and FetchFile you would configure cluster load balancing. Here you can see I selected basic round robin. You'll notice a connection with load balancing configured will also render a bit different: What happen on this connection is that all the 0 bytes FlowFiles will be redistributed in round robin style to all connected nodes. Then on each node the FetchFile will get each nodes subset of FlowFiles content. This reduce need to transmit content of network between nodes and reduces disk IO on primary node since it is not fetching all the content. If you search the Apache NiFi documentation you will see many list and fetch combination type processors. But any source processor (one with no inbound connection) could be configured for primary node only. But only schedule a source processor as primary node execution if required. Doing so on processors like ConsumeKafka for example that uses cluster friendly protocols would just impact performance. Hope this answers your question only what the difference is between Cluster Coordinator and Primary Node roles in a NiFi cluster. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-03-2024

@manishg I recommend starting a new community question with your detailed query. This community question is 5.5 years old for a 5.5 year old version of NiFi. A lot has evolved and changed since then. Make sure with your query you include details about your NiFi version. Thank you, Matt

MattWho · ‎04-29-2024

@SAMSAL Without being indexed, I can't think of any other way to parse the provenance data.

MattWho · ‎04-26-2024

@AlexisRub Not sure how to answer that for you. Typically production users who have access to a corporately managed LDAP/AD would use that with their NiFi. This provide better security as corporate can mange that adding of new users or removal of users no longer with the organization. If you also setup the ldap-user-group-provider in NiFi authorizers.xml along with setting of the ldap-provider in the login-identity-providers.xml you'll have a proper production setup. Let's say a new person joins the company and is added to the AD. the ldap-user-group-provider (depending on filters) could automatically pull in that new user identity to NiFi allowing your NiFi admin to setup access policies for them easily. And with the ldap-provider that user could then authenticate to your NiFi (successful authentication does not mean they would have authorized access). Even better is this opens the ability to use ldap/AD managed groups for authorization. Let's say you have AD group named nifiadmins. You could sync this group and its members to NiFi via the ldap-user-group-provider and set up local authorization policies using that group identity. So later some user is added or removed from the AD "nifiadmins" group. When NiFi syncs with ldap/AD via ldap -user-group-provider (default is every 30 mins), that user would be added or removed as a known member of that group and would gain or lose authorizations without needing any manual action within NiFi to make that happen. This is most common setup fro production end users with established ldap/AD groups for different teams that will access NiFi. Different teams can then be authorized access to only specific process groups and actions. I setup a local ldap which creates a bunch of fake users and groups that i can manage for testing purposes., but not something I would do in a production setup. I would leave the corporate management of user to those responsible for that access control. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

Online	Offline
Last Visited	‎10-23-2025 07:03 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎10-23-2025 07:03 AM
Posts	3,391
Kudos received	1614

Cloudera Community

Re: How to achieve inheritence within Parameter Co...

Re: using nifi as a kafka streaming- real-time str...

Re: using nifi as a kafka streaming- real-time str...

Re: Nifi Registry and LDAP

Re: NiFi logs not rolling over on Windows

Re: Apache Nifi Invalid SNI

Re: Unstable cluster

Re: Why only flowfile repository disk is getting f...

Re: Manually change connection id apache NiFi

Re: java.lang.ClassNotFoundException: org.apache.n...

Re: Invoke Http with url containing %2F

Re: Role of primary node

Re: NiFi node keeps disconnecting from the cluster

Re: Querying Data Provenance using FlowFile Attrib...

Re: Authentication and authorization methods in ap...