About MattWho

MattWho · ‎05-21-2024

@Racketmojster So you do want to ingest all the files within a specific parent directory path. Use case details are important here. I was not sure if perhaps your script fetched the files when passed a folder date string. There are certainly challenges in with using MergeContent processor which may require modifications to your script to make it work. MergeContent provides numerous Merge Formats for how you want this processor to merge the content from multiple source FlowFiles: The default is binary concatenation which just appends the bytes from one FlowFile to the end of the previous FlowFiles content. You could specify a delimiter to keep track of filename of each section of bytes and to mark where one file's bytes start and end. This would require your script to parse this binary concatenated content and split the files by delimiter and obtain the individual content's filenames from the delimiter. This is rather messy. You could use ZIP or TAR formats to merge the files in to a TAR or ZIP files that retains the directory structure and filenames. This format would require that your script can then untar or unzip the content in order to process the individual files within the bundle. This is a less messy option. Once you have decided on the merge format, you need to configure the MergeContent to make sure all the files from within a unique parent directory (unique date directory) are merged with one another. This is where the "Correlation Attribute Name" setting is used. All source FlowFiles where the value from the configured FlowFile attribute is identical will be placed in same MergeContent bin. For example, you might use "absolute.path" in this property, but you'll need to make sure this will work in your use case since i don't know what your complete source structure tree looks like. You could also use an UpdateAttribute processor before MergeContent to create a new FlowFIle attribute with just the extracted date string from absolute.path and then use that new FlowFile attribute as your correlation attribute setting. Now you need to make sure MergeContent does not merge a bin before all expected FlowFiles have been added to bin. The way MergeContent processor works is that when executed it reads FlowFile(s) from inbound connection and allocates it/them to a bin. After allocating that FlowFile to a bin, it checks to see if that bin is eligible to be merged. Since NiFi executes processor by default as fast as possible (milliseconds count here), it is possible at the very moment it looks at the inbound connection not all FlowFiles from a specific directory may be present yet. Determining if a bin should be merged is controlled by the "Minimum Number of Entries", "Minimum Group Size", and "Max Bin Age" configuration properties. If both min settings are satisfied a bin will be merged. If both mins have not been met, but the max bin age (age of bin since first FlowFile was allocated to bin) has been reached or exceeded, the bin will be merged. With the defaults you can see how it is very easy that a bin may get merged before all necessary FlowFiles have been added to it. The "Maximum number of Bins" is also very important here. If you have 5 bins (default) and you have 20 source dated directories. it becomes possible that FlowFiles get allocated by that correlation attribute to 5 different bins, and then another FlowFile is processed that belong to none of those 5 bins (6th dated directory). What happens in this scenario is that MergeContent forces the merge of the oldest bin regardless or mins or max bin age in order to free a bin to allocated the next FlowFile to. If all your source unique dated directories contain same number of files, setting the min number entries is easy, but that is probably not the case. So what a user would typically do is set mins num entries to a value larger then any source directory would ever have to prevent bin from merging until the max bin age is reached. This introduces the latency in your dataflow by that configured max bin age. Hope this helps you understand the MergeContent processor configuration options better within the context of your specific dataflow needs. The above does not to me seem like an efficient dataflow. ListFile is listing files from the local filesystem, the fetchFile fetches the content for those listed files and adds it to the FlowFile. Then if you use the MergeContent to create a tar or zip, your script is going to need to untar or unzip that bundle somewhere to process the files. So you are effectively reading content rom a directory, which means writing content to NiFi's content repository, then tar/zipping which is another write to NiFi's content repository, then your script is untar/unzip that files (another write to somewhere local) in order to process the files. That is why in my original response i suggested avoiding the FetchFile and only use the ListFile to get the unique date source directory and pass that absolute path to your script that processes the files directly our of the source path. Reduces latency and a lot of disk IO. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-21-2024

@udayAle Please start a new community question for your unrelated follow-up question above. Responses to an unrelated question will lead to confusion to other community members who may be having similar problems. You can use @<username> to notify specific people about new community questions. Thank you, Matt

MattWho · ‎05-20-2024

@manishg How many cpu cores does each of your NiFi hosts have? 1 means you are using 100% of 1 cpu on average. 20 means you are using 100% of 20 cores on average. etc... so lets say your node has 8 cores but your load average is higher then 8, this means your cpu is saturated and being asked to perform more work then can be handled efficiently. This leads to long thread execution times and can interfere with timely heartbeats being sent by nodes or processed by the elected cluster coordinator. Often times this is triggered by too many concurrent tasks on high CPU usage processors, high FlowFile volume, etc. You can ultimately design a dataflow that simply needs more CPU then you have to work at the throughput you need. User commonly just start configuring more and more concurrent tasks and set the Max Timer Driven thread pool way to high for the number of cores available on a node. This allows more threads to execute concurrently, but just results in each thread taking longer to complete as their time is sliced on the CPU. thread 1 gets some time on CPU 1 and then goes to time wait as another thread gets some time, eventually thread 1 will get a bit more time. More millisecond threads that is not a big deal, but for CPU intensive processors it can cause issues. Lets say you have numerous CPU intensive thread executing at same time, and the heartbeat is scheduled. the scheduled thread is now waiting in line for time on the CPU. Sometimes Alternate dataflow design can be used that use less CPU. Sometimes you can add more nodes. Sometimes you need to move some dataflows to different cluster. Sometimes you just need more CPU. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-20-2024

Hello @hegdemahendra Always very helpful if you include the exact version of Apache NiFI, Cloudera HDF, or Cloudera CFM being used. My guess here would be one or both of the following: You have multiple FlowFiles all pointing at the same content claims queued in connections within your dataflow(s) on the canvas. As long as a FlowFile exists on the canvas it will exist in flowfile_repository. Users should avoid leaving FlowFiles queued in connection on NiFi. Some users tend to allow FlowFile to accumulate at stopped processor components rather then auto-terminate them. Even if a FlowFile does not have any content its FlowFile attributes/metadata still consume disk space. You are extracting content from your FlowFiles into FlowFile attributes resulting in large FlowFile attribute/metadata being stored in the flowfile_repository. Dataflow designers should avoid extracting large amounts flowfile content in to the FlowFile's attributes. Instead try to build dataflows and utilize components that read content from the FlowFile's content instead of from FlowFile attributes. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-20-2024

@galt @RAGHUY Let me add some correction/clarity to the accepted solution. Export and Modify Flow Configuration: Export the NiFi flow configuration, typically in XML format. This can be done via the NiFi UI or by utilizing NiFi's REST API. Then, manually adjust the XML to change the ID of the connection to the desired value. It is not clear here what is being done. The only way to export a flow configuration from NiFi in XML format is via generating a NiFi template (deprecated and removed in Apache NIFi 2.x versions). Even if you were to generate a template and export is via NiFi UI or NiFi's rest-api, modifying it will not change what is on the canvas. If you were to modify the connection component UUID in all places in the template. Upon upload of that template back in to NiFI, you would need to drop the template on the the canvas which would result in every component in that template getting a new UUID. So this does not work. In newer version of NiFi 1.18+ NiFi supports newer flow definitions which are in json format. but same issue persists here when using flow definitions in this manor. In a scenario like the one described in this post where user removed a connection by mistake and then re-created it, the best option is to restore/revert the previous flow. Whenever a change is made to the canvas, NIFi auto archives the current flow.xml.gz (legacy) and flow.json.gz (current) file in to an archive sub-directory and generates a new flow.xml.gz/flow.json.gz file. Best and safest approach approach is to shutdown all nodes in your NiFi cluster. Navigate to the NiFi conf directory and swap current flow.xml.gz/flow.json.gz files with the archived flow.xml.gz/flow.json.gz files still containing the connection with original needed ID. When above is not possible (maybe change went unnoticed for to long and all archive version have new connection UUID), you need to manually modify the flow.xml.gz/flow.json.gz files. Shutdown all your NiFi nodes to avoid any changes being made on canvas while performing following steps. Option 1: Make backup of current flow.xml.gz and flow.json.gz Search each file for original UUID to make sure it does not exist. On one node manually modify the flow.xml.gz and flow.json.gz files by locating the current bad UUID and replacing it with the original needed UUID. Copy the modified flow.xml.gz and flow.json.gz files to all nodes in the cluster replacing original files. this is possible since all nodes run same version of flow. Option 2: same as option 1 same as option 1 same as option 1 Start NiFi only on the node where you modified the flow.xml.gz and flow.json.gz files. On all other nodes still stopped, remove or rename the flow.xml.gz and flow.json.gz files. Start all the remaining nodes. since they do not have a flow.xml.gz or flow.json.gs to load, they will inherit the flow from the cluster as they join the cluster. NOTE: The flow.xml.gz was replaced by the newer flow.json.gz format starting with Apache NiFi 1.16. When NiFi is 1.16 or newer is started with and only has a flow.xml.gz file, it will load from flow.xml.gz and then generate the new flow.json.gz format. Apache NiFi 1.16+ will load only from the flow.json.gz on startup when that file exists, but will still write out both the flow.xml.gz and flow.json.gz formats anytime a change is made to the canvas. With Apache NiFi 2.x+ version the flow.xml.gz format will go away. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-20-2024

@jpconver2 @RAGHUY All 1.x versions of NiFi do not support rolling upgrades. With the major release of the NiFi 2.x versions, NiFi added rolling upgrade support as part of NIFI-12016 - Improve leniency for bundle compatibility to allow for rolling upgrades. The above Apache NiFi jira does a great job of explaining why historically in the NiFi 1.x branch this was not implemented. Above new feature improvement is included in NiFi 2.0.0-M1 (NiFi 2.0.0 milestone 1) or newer. Please help our community grow. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-20-2024

@SAMSAL This is not a new problem, but rather something that has existed with NiFi on Windows fro a very long time. You'll need to avoid using space in directory names or warp that directory name in quotes to avoid the issue. NIFI-200 - Bootstrap loader doesn't handle directories with spaces in it on Windows Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-20-2024

@Lorenzo Based on log output shared the Spnego based authentication was successful and you have an authorization problem for your Spnego authenticated user. NiFi Authorization is case sensitive, so the user identity returned via kerberos-provider login provider is likely not the exact same user identity string returned via Spnego based kerberos authentication. "myuserad" is a different user identity then "myaduser" and different user identity then "MyAduser" and different user identity then "myaduser@domain.com" and .etc... NiFi provides identity mapping properties which can be used to manipulate the user identity returned by different user authentication methods before the final manipulated user identity is passed over the the NiFi authorizer to check for proper authorization(s). These are added to the nifi.properties file: Identity Mapping Properties NOTE: keep in mind that mapping patterns are checked against the user identity output during authentication in an alpha-numeric order. First pattern (regex) to match has its value and transform applied at which time not additional mapping patterns will get evaluated. So as your pattern regular expressions get more generic the farther down the alpha-numeric list they need to be. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-20-2024

@Racketmojster A NiFi FlowFile consists of FlowFile Content (physical bytes of the source content) and FlowFile Attributes (Attributes about the content and FlowFile). You did not share how you are retrieving the parent folder: My nifi setup is to retrieve the date folders inside the parent directory of Logfolder I assume maybe you tried to use getFile or FetchFile? Also not very clear on what you are trying to do from this statement: date folders as a whole is passed to the python script A detailed use case might help here. NiFi works with content and directories are not content. They are tracked in NiFi more as attributes/metadata related to the content ingested. If you look at the FlowFile attributes on your FlowFile created for logs.txt, you should see an "absolute.path" attribute that would have the full path to the logs.txt file. If that is all you need and you have no need to actually get the content of the logs.txt file itself in NiFi, you could use juts the ListFile processor to get only the metadata/attributes about the found file and pass the value from "absolute.path" to your executeStreamCommand script. If each of your date folders contains multiple files, you would need to design a dataflow that accounts for that and eliminates duplicates so you only execute your ExecuteStreamCommand script. For example use the ReplaceText processor (Always Replace strategy) to write the "absolute.path" value or date portion of the full path to the content of the FlowFile, then use DetectDuplicate processor to purge all duplicates before your ExecuteStreamCommand processor. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

MattWho · ‎05-06-2024

@jonay__reyes or @DeepakDonde The issue you are encountering if caused by a code change so the the invokeHTTP would encode URLs automatically. This issue is triggered if your URL is already encoded. The URL encoding change will convert all '%' to '%25'. Workarounds/solutions: Remove your url encoding and allow processor to do that encoding. If your URL is not URL encoded already and happens to contain '%' in the URL you can do the following: If using Apache NiFi 1.25 verions: Upgrade to NIFi 1.26 which contains fix NIFI-12842. (now released) Try adding Apache NiFi 1.26 version of the NiFi standard nar to your 1.25 install. Downgrade to Apache NiFi 1.24 If using Apache NiFi 2.0.0-M2 Wait for upcoming release of NiFi 2.0.0-M3 and upgrade, which will contain fix. Downgrade to Apache NiFi 2.0.0-M1 Try adding Apache NiFi 2.0.0-M1 Standard nar to your 2.0.0-M2 install and switch to using older 2.0.0-M1 invokeHTTP processor. Please help our community thrive. If you found any of the suggestions/solutions provided helped you with solving your issue or answering your question, please take a moment to login and click "Accept as Solution" on one or more of them that helped. Thank you, Matt

Online	Offline
Last Visited	‎01-14-2026 12:41 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎01-14-2026 12:41 PM
Posts	3,421
Kudos received	1620

Cloudera Community

Re: Best Practice for configuring registry flows

Re: Nifi 2.7.2 Start Problem

Re: Error importing NiFi workflow template from ve...

Re: Error importing NiFi workflow template from ve...

Re: How to elevate a default nifi user to admin - ...

Re: Apache Nifi passing a folder as an flowfile to...

Re: How to increase number of requests In Invoke H...

Re: Unstable cluster

Re: Why only flowfile repository disk is getting f...

Re: Manually change connection id apache NiFi

Re: Nifi Rolling Update Strategy in Kubernetes (to...

Re: java.lang.ClassNotFoundException: org.apache.n...

Re: NiFi Rest API via Kerberos token

Re: Apache Nifi passing a folder as an flowfile to...

Re: Invoke Http with url containing %2F