Member since
07-30-2019
3471
Posts
1642
Kudos Received
1020
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 168 | 06-03-2026 06:06 PM | |
| 461 | 05-06-2026 09:16 AM | |
| 832 | 05-04-2026 05:20 AM | |
| 499 | 05-01-2026 10:15 AM | |
| 627 | 03-23-2026 05:44 AM |
05-20-2024
02:01 PM
1 Kudo
Hello @hegdemahendra Always very helpful if you include the exact version of Apache NiFI, Cloudera HDF, or Cloudera CFM being used. My guess here would be one or both of the following: You have multiple FlowFiles all pointing at the same content claims queued in connections within your dataflow(s) on the canvas. As long as a FlowFile exists on the canvas it will exist in flowfile_repository. Users should avoid leaving FlowFiles queued in connection on NiFi. Some users tend to allow FlowFile to accumulate at stopped processor components rather then auto-terminate them. Even if a FlowFile does not have any content its FlowFile attributes/metadata still consume disk space. You are extracting content from your FlowFiles into FlowFile attributes resulting in large FlowFile attribute/metadata being stored in the flowfile_repository. Dataflow designers should avoid extracting large amounts flowfile content in to the FlowFile's attributes. Instead try to build dataflows and utilize components that read content from the FlowFile's content instead of from FlowFile attributes. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
05-20-2024
01:38 PM
@galt @RAGHUY Let me add some correction/clarity to the accepted solution. Export and Modify Flow Configuration: Export the NiFi flow configuration, typically in XML format. This can be done via the NiFi UI or by utilizing NiFi's REST API. Then, manually adjust the XML to change the ID of the connection to the desired value. It is not clear here what is being done. The only way to export a flow configuration from NiFi in XML format is via generating a NiFi template (deprecated and removed in Apache NIFi 2.x versions). Even if you were to generate a template and export is via NiFi UI or NiFi's rest-api, modifying it will not change what is on the canvas. If you were to modify the connection component UUID in all places in the template. Upon upload of that template back in to NiFI, you would need to drop the template on the the canvas which would result in every component in that template getting a new UUID. So this does not work. In newer version of NiFi 1.18+ NiFi supports newer flow definitions which are in json format. but same issue persists here when using flow definitions in this manor. In a scenario like the one described in this post where user removed a connection by mistake and then re-created it, the best option is to restore/revert the previous flow. Whenever a change is made to the canvas, NIFi auto archives the current flow.xml.gz (legacy) and flow.json.gz (current) file in to an archive sub-directory and generates a new flow.xml.gz/flow.json.gz file. Best and safest approach approach is to shutdown all nodes in your NiFi cluster. Navigate to the NiFi conf directory and swap current flow.xml.gz/flow.json.gz files with the archived flow.xml.gz/flow.json.gz files still containing the connection with original needed ID. When above is not possible (maybe change went unnoticed for to long and all archive version have new connection UUID), you need to manually modify the flow.xml.gz/flow.json.gz files. Shutdown all your NiFi nodes to avoid any changes being made on canvas while performing following steps. Option 1: Make backup of current flow.xml.gz and flow.json.gz Search each file for original UUID to make sure it does not exist. On one node manually modify the flow.xml.gz and flow.json.gz files by locating the current bad UUID and replacing it with the original needed UUID. Copy the modified flow.xml.gz and flow.json.gz files to all nodes in the cluster replacing original files. this is possible since all nodes run same version of flow. Option 2: same as option 1 same as option 1 same as option 1 Start NiFi only on the node where you modified the flow.xml.gz and flow.json.gz files. On all other nodes still stopped, remove or rename the flow.xml.gz and flow.json.gz files. Start all the remaining nodes. since they do not have a flow.xml.gz or flow.json.gs to load, they will inherit the flow from the cluster as they join the cluster. NOTE: The flow.xml.gz was replaced by the newer flow.json.gz format starting with Apache NiFi 1.16. When NiFi is 1.16 or newer is started with and only has a flow.xml.gz file, it will load from flow.xml.gz and then generate the new flow.json.gz format. Apache NiFi 1.16+ will load only from the flow.json.gz on startup when that file exists, but will still write out both the flow.xml.gz and flow.json.gz formats anytime a change is made to the canvas. With Apache NiFi 2.x+ version the flow.xml.gz format will go away. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
05-20-2024
12:36 PM
@SAMSAL This is not a new problem, but rather something that has existed with NiFi on Windows fro a very long time. You'll need to avoid using space in directory names or warp that directory name in quotes to avoid the issue. NIFI-200 - Bootstrap loader doesn't handle directories with spaces in it on Windows Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
05-07-2024
01:30 AM
1 Kudo
Thanks @MattWho for the quick response. The short version of "why does it doesn't work" means is that nifi keeps working, recognizes the addition of the NARs but no "1.24" versions of components are available in the Processor selection window. I promptly tried the "M1" download workaround and found this as soon as I wget'ed the NAR: text version of the above: INFO Found ./extensions/nifi-standard-shared-nar-2.0.0-M1.nar in auto-load directory
INFO Starting load process for 1 NARs...
INFO Creating class loaders for 1 NARs...
WARN Unable to resolve required dependency 'nifi-standard-services-api-nar'. Skipping NAR '/usr/local/lib/nifi-2.0.0-M2/./work/nar/extensions/nifi-standard-shared-nar-2.0.0-M1.nar-unpacked'
INFO Successfully created class loaders for 0 NARs, 1 were skipped
INFO Finished NAR loading process! so, since it asks for nifi-standard-shared, I'll download it also and: text version: INFO Found ./extensions/nifi-standard-services-api-nar-2.0.0-M1.nar in auto-load directory
INFO Starting load process for 1 NARs...
INFO Including 1 previously skipped bundle(s)
INFO Creating class loaders for 2 NARs...
WARN While loading 'org.apache.nifi:nifi-standard-services-api-nar:2.0.0-M1' unable to locate exact NAR dependency 'org.apache.nifi:nifi-jetty-bundle:2.0.0-M1'. Only found one possible match 'org.apache.nifi:nifi-jetty-bundle:2.0.0-M2'. Continuing...
INFO Loaded NAR file: /usr/local/lib/nifi-2.0.0-M2/./work/nar/extensions/nifi-standard-services-api-nar-2.0.0-M1.nar-unpacked as class loader org.apache.nifi.nar.NarClassLoader[./work/nar/extensions/nifi-standard-services-api-nar-2.0.0-M1.nar-unpacked]
INFO Loaded NAR file: /usr/local/lib/nifi-2.0.0-M2/./work/nar/extensions/nifi-standard-shared-nar-2.0.0-M1.nar-unpacked as class loader org.apache.nifi.nar.NarClassLoader[./work/nar/extensions/nifi-standard-shared-nar-2.0.0-M1.nar-unpacked]
INFO Successfully created class loaders for 2 NARs, 0 were skipped EDIT: I also downloaded the jetty package, but: INFO Starting load process for 1 NARs...
ERROR Found a Jetty NAR, will not auto-load /usr/local/lib/nifi-2.0.0-M2/./extensions/nifi-jetty-bundle-2.0.0-M1.nar
INFO No NARs were unpacked, nothing to do so, it looks like it's acknowledging the NARs I provide, but after reloading the interface: You've been very helpful (and quick) providing the causes and several workarounds (I'll go the "build your URL on the fly" way), so thank you a lot and my double "accept as solution" click 🙂
... View more
05-03-2024
08:41 AM
2 Kudos
@manishg The elected cluster coordinator (elected by Zookeeper) is responsible for receiving and processing heartbeats from other nodes in the cluster. It handles the connecting, reconnecting, and manual disconnecting of NiFi nodes. The Cluster coordinator is also responsible for replicating user request to all nodes in the cluster and get confirmation from those nodes that the request was completed successfully. Assume a 3 node cluster with following: node1 - elected cluster coordinator node2 - elected primary node node3 Role of Cluster Coordinator: A user can access the NiFi cluster via any of the 3 node's URL. So lets say a user logs in node3's UI. When that user interacts with node 3 UI that request is proxied to the currently elected cluster coordinator node that in turn replicates the request all 3 nodes (example: add a processor, configure a processor, empty a queue, etc...). If one of the nodes were to fail to complete the request, that node would get disconnected. I may attempt to auto-reconnect later (In newer version of NiFi a connecting node can inherit the clusters flow and replace it local flow only if doing so would not result in dataloss. Role of the Primary Node: The elected primary node is responsible for scheduling the execution of any NiF component processor on the canvas that is configured for primary node only. This is configured in a processor's configuration "scheduling" tab: Primary node scheduled processors with display a "P" in the upper left corner as seen above. NOTE: ONLY processors with no inbound connections should ever be set to "primary node" execution. Doing so on processor with inbound connection can lead to FlowFiles becoming stuck in those connection when the elected primary node changes. Not all protocols are "cluster" friendly, so primary node execution helps dataflow designers work around that limitation while still benefiting from having a multi-node cluster. NiFi has numerous "List<XYZ>" and "Fetch<XYZ>" type processor typically used to handle non cluster friendly protocols. I'll use ListFile and FetchFile as an example. Let say our 3 node cluster as the sane network directory mounted to every node. If I was to add the ListFile processor and leave it configured with "all nodes" execution and configure it to list files on that shared mount. All three nodes in the NiFi cluster would produce FlowFiles for all the files listed (so you have files in triplicate). Now if i were to configure my ListFile with "primary node" execution, the listFile would only get scheduled to execute on the currently elected primary node (these processor also record cluster state in ZK so that if a elected primary node changes it doe snot result in a re-listing of the same files again). To prevent overloading the primary node, the list based processors do not retrieve the source content. It only creates a 0 byte FlowFile with attributes/metadata about the source file. So the List based processor would then be connected downstream to its corresponding FetchFile processor. The FetchFile for example would the use the metadata from the 0 byte FlowFile to fetch the content and add it to the FlowFile. On the connection between ListFile and FetchFile you would configure cluster load balancing. Here you can see I selected basic round robin. You'll notice a connection with load balancing configured will also render a bit different: What happen on this connection is that all the 0 bytes FlowFiles will be redistributed in round robin style to all connected nodes. Then on each node the FetchFile will get each nodes subset of FlowFiles content. This reduce need to transmit content of network between nodes and reduces disk IO on primary node since it is not fetching all the content. If you search the Apache NiFi documentation you will see many list and fetch combination type processors. But any source processor (one with no inbound connection) could be configured for primary node only. But only schedule a source processor as primary node execution if required. Doing so on processors like ConsumeKafka for example that uses cluster friendly protocols would just impact performance. Hope this answers your question only what the difference is between Cluster Coordinator and Primary Node roles in a NiFi cluster. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
04-29-2024
05:05 AM
@SAMSAL Without being indexed, I can't think of any other way to parse the provenance data.
... View more
04-26-2024
06:28 AM
@AlexisRub Not sure how to answer that for you. Typically production users who have access to a corporately managed LDAP/AD would use that with their NiFi. This provide better security as corporate can mange that adding of new users or removal of users no longer with the organization. If you also setup the ldap-user-group-provider in NiFi authorizers.xml along with setting of the ldap-provider in the login-identity-providers.xml you'll have a proper production setup. Let's say a new person joins the company and is added to the AD. the ldap-user-group-provider (depending on filters) could automatically pull in that new user identity to NiFi allowing your NiFi admin to setup access policies for them easily. And with the ldap-provider that user could then authenticate to your NiFi (successful authentication does not mean they would have authorized access). Even better is this opens the ability to use ldap/AD managed groups for authorization. Let's say you have AD group named nifiadmins. You could sync this group and its members to NiFi via the ldap-user-group-provider and set up local authorization policies using that group identity. So later some user is added or removed from the AD "nifiadmins" group. When NiFi syncs with ldap/AD via ldap -user-group-provider (default is every 30 mins), that user would be added or removed as a known member of that group and would gain or lose authorizations without needing any manual action within NiFi to make that happen. This is most common setup fro production end users with established ldap/AD groups for different teams that will access NiFi. Different teams can then be authorized access to only specific process groups and actions. I setup a local ldap which creates a bunch of fake users and groups that i can manage for testing purposes., but not something I would do in a production setup. I would leave the corporate management of user to those responsible for that access control. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
04-26-2024
06:02 AM
@s198 Back Pressure thresholds are configured on NiFi connections between processors. There are two types of back pressure thresholds 1. Object Threshold - Back pressure is applied once the Number of FlowFiles reaches or exceeds the setting (default is 10,000 FlowFiles). Applied per node and not across all nodes in a NiFi cluster. 2. Size Threshold - Back pressure is applied once the total data size of queued FlowFiles reaches or exceeds the setting (default is 1 GB). Applied per node and not across all nodes in a NiFi cluster. When Back pressure is being applied on a connection, it prevents the immediate processor that feeds data into that connection form being scheduled to execute until the back pressure is no longer being applied. Since back pressure is a soft limit, this explains you two different scenarios: 1. 20 FlowFiles being transferred to connection feeding your mergeContent processor. Initially that connection is empty so no back pressure is applied. The preceding processor that starts adding FlowFiles to that connection until the "Size Threshold" of 1 GB was reached and thus back pressure is then applied preventing the preceding processor from being scheduled and processing the remaining 6 files. The max bin age set on your mergeContent processor then forces the bin containing the first 14 FlowFiles to merge after 5 minutes thus removing the back pressure that allowed nect 6 files to be processed by upstream processor. 2. The connection between the FetchHDFS and PutSFTP processor has no back pressure being applied (neither object threshold or size threshold has been reached or exceeded), so the FetchHDFS is scheduled to execute. The execution resulted in a single FlowFile larger then the 1 GB size threshold, so back pressure would be applied as soon as that 100 GB file was queued. As soon as the putSFTP successfully executed and moved the FlowFile to one of it's downstream relationships, the FetchHDFS would have been allowed to get scheduled again. There are also processor that do execute on batches of files in a single execution. The list and split based processors like listFile and splitContent are good examples. It is possible that the listFile processor performs a listing execution containing in excess of 10,000 object threshold. Since no back pressure is being applied that execution will be successful and list create all 10,000+ FlowFiles that get transferred to the downstream connection. Back pressure will then be applied until the number of FlowFiles drops back below the threshold. That means as soon as it drops to 9,999 back pressure would be lifted and the listFile processor would be allowed to execute. In your mergeContent example you did the proper edit to object size threshold to allow more FlowFiles to queue in the upstream connection to your mergeContent. If you left the downstream connection containing the "merged" relationship with default size threshold, back pressure would have been applied as soon as the merged FlowFile was added to the connection since its merged size exceeded the 1 GB default size threshold. PRO TIP: You mentioned that your daily merge size may vary from 10 GB to 300GB for your mergeContent. How to handle this in the most efficient way depends really on the number of FlowFiles and no so much on the size of the FlowFiles. Only thing to keep in in mind with size thresholds is the content_repository size limitations. The total disk usage by the content repository is not equal to the size of the actively queued FlowFiles on the canvas due to the fact the content is immutable once created and how NiFi stores FlowFile's content in claims. NiFi holds FlowFile attributes/metadata in NiFi's heap memory for better performance (swapping thresholds exist to help prevent Out of Memory issues but impact performance when swapping is happening). NiFi sets object threshold at 10,000 because swapping does not happen at that default size. When merging batches of FlowFiles in very large number you can get better performance from two MergeContent processors in series instead of just one. To help you understand above more, I recommend reading the following two articles: https://community.cloudera.com/t5/Community-Articles/Dissecting-the-NiFi-quot-connection-quot-Heap-usage-and/ta-p/248166 https://community.cloudera.com/t5/Community-Articles/Understanding-how-NiFi-s-Content-Repository-Archiving-works/ta-p/249418 Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt
... View more
04-23-2024
07:56 AM
Thank you, @MattWho for providing timely responses and quick solutions to the queries. You are really helping the community grow. Hats off to you. Appreciate it
... View more
04-22-2024
12:58 PM
@MattWho You will see this behavior even if you don't use docker only nifi standard binaries. I validated this copying those dirs below from 1.25.0 installation to 2.0.0-M2 installation both without using docker. conf/ content_repository/ database_repository/ flowfile_repository/ provenance_repository/ state/ Thank you again
... View more