About MattWho

MattWho · ‎10-19-2021

@AA24 NiFi was designed as an always on type of dataflow design. As such the NiFi processor components support "Timer Driven" and "Cron Driven" Scheduling Strategy types. That being said, the ability to tell a processor to "Run Once" exists within NiFi. You could manually do from within the UI by right clicking on the NiFi processor component and selecting "run once" from the pop-up context menu. The next thing to keep in mind is that anything that you can do via the UI, you can also do via a curl command. So it is possible to build a dataflow that could trigger the "run once" api call against the processor you want to fetch from the appropriate DB. You can not execute "run once" against a PG nor would I recommend doing so. You want to only trigger the file responsible for ingesting your data and leave all the other processor running all the time so they process whatever data they have queued at anytime. First you to create your trigger flow, so you could have a getFile to consume the trigger file and use maybe a RouteOnContent processor to send the FlowFile to either an InvokeHTTP configured to invoke run-once on your Oracle configured processor or an invokeHTTP configured to invoke run-once on your MySQL configured processor. Using your browser's developer tools is an easy way to capture the rest-api calls that are made when you manually perform them the action via the UI. If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post. Thank you, Matt

MattWho · ‎10-19-2021

@DayDream The ExecuteStreamCommand processor executes a system level command and not something native within NiFi, so its impact on CPU is completely dependent on what the command being called is doing. You mention that the ExecuteStreamCommand is just executing a CP command and that issue happens when you are dealing with a large file. The first thing I would be looking in to is disk I/O of the source and destination directory location where the file is being copied from and copied to. You also mention that the PutFile is writing out a large FlowFile to disk. This means that the processors is reading FlowFile content from the NiFi content_repository and then writing it to some target folder location. I would once again look at the disk I/O of both locations when this is happening. The CPU usage may be high simply because these threads are running a long time waiting on disk I/O. NiFi uses CPU for its core level functions and then you configure an additional thread pool that is used by the NiFi components you add to the NiFi canvas. This resource pool is configured via NiFi UI --> Global Menu (upper right corner of UI) --> Controller Settings: The "Event Driven" thread pool is experimental and deprecated and is used by processors configured to use the event driven scheduling strategy. Stay away from this scheduling strategy. The "Timer Driven" thread pool is used by controller services, reporting tasks, processors, etc... The Processors will use it when configured to use the "Timer Driven" or "Cron driven" scheduling strategies. This pool is what is available for the NiFi controller to hand out to all processors requesting time to execute. Setting this value to an arbitrarily high value will simply lead to many NiFi components getting threads to execute but then spending excessive time in CPU wait as the time on the limited cores is time sliced across all active threads. The general rule of thumb here is to set the pool to 2 to 4 times the number of available core on a single NiFi host/node. So for your 8 core server, you would want this between 16 and 32. This does not mean you can't set this higher, but should only do this in smaller increments while monitoring CPU usage over extended period of time. If you have 5 nodes, this setting is per node so you would have a thread pool of 16 - 32 on each NiFi host/node. Another thing you may want to start looking at is the GC stats for your JVM. Is GC (young and old) running very often? Is it taking a long tome to run? All GC is a stop-the-world event, so the JVM simply is paused while this is going on which can also impact how long a thread is "running". You can get some interesting details about your running NiFi using the built in NiFi diagnostics tool. <path to NiFi>/bin/nifi.sh diagnostics --verbose <path/filename where output should be written> For a NiFi node to remain connected to it must be successful at sending a heartbeat to the elected cluster coordinator at least 1 out of 8 scheduled heartbeat intervals. Let's say the heartbeat interval is configured in the nifi.properties file for 5 secs, then the elected CC must successfully process at least 1 heartbeat every 40 secs or that node would get disconnected for lack of heartbeat. The node would initiate a reconnection once a heartbeat is received after having been disconnected for above reason. Configuring a larger heartbeat interval will help avoid this disconnect/reconnect by allowing from time before heartbeat is considered lost. This would allow more time if the node is going through a long GC pause or the CPU is so saturated it can't get a thread to create a heartbeat. I also recommend reading through this community article: https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-high/ta-p/244999 If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post. Thank you, Matt

MattWho · ‎10-19-2021

@vikrant_kumar24 The ExecuteScript processor has been around for over 6 years as part of Apache NiFi. It has had many improvements and bug fixes over those years just like many other well used components. I'd be reluctant from calling it "experimental" any longer regardless of what the embedded Apache NiFi docs say. The only thing to note here is that the ExecuteScript processor does not really execute the "Python" script engine. It is executing "Jython" instead which is a Java implementation of Python. Jython is not 100% compatible with Python, so you must test you script thoroughly. Thanks, Matt

MattWho · ‎10-12-2021

@CodeLa Giving detailed responses on such a use case would be very difficult in the community. This would take considerable effort and time and would require you to provide a lot more detail to include sample source json files, schemas, etc... Cloudera offers professional services for its customers to help them with their use case solutions. If you have a support contract with Cloudera, please reach out to your account owner about this service. At a very high level, I would suggest you take a look at the PutDatabaseRecord processor and perhaps configure it to use one of the json readers: JsonPathReader JsonTreeReader The processor would also need a DBCPConnectionPool for connecting to your MySQL DB. If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post. Thank you, Matt

MattWho · ‎10-08-2021

@Ankit13 The PutFile is going to execute base on its configured run schedule (Timer Driven Execution) or cron schedule (Cron Driven Execution). If you are adding an attribute to the FlowFile that you are using to evaluate a boolean true or false, the best approach is to add a RouteOnAttribute processor between your FetchFile and PutFile processors to redirect those FlowFiles where your condition does not resolve to "true". In this way you selectively decide which FlowFiles to pass to the putFile to be executed upon. As far as the RouteOnAttribute, it will have an unmatched relationship and you add dynamic properties which become new relationships that can be associated with different connections. You can use NiFi Expression Language (NEL) [1] to construct a boolean statement to evaluate your routing condition. [1] https://nifi.apache.org/docs/nifi-docs/html/expression-language-guide.html If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post. Thank you, Matt

MattWho · ‎10-08-2021

@Ankit13 How do you know no more files we will be put after the NiFi flow processing starts? To me in sound like the PutFile should execute at default 0 secs (as fast at it can run) and you should instead control this dataflow at the beginning were you consume the data. For example: In a 24 hour window data is being written to source directory to be consumed from between 00:00:00 and 16:00:00. Then you want to write that data to target directory starting at 17:00. So you instead setup a cron on a listFile processor to consume list the files at 17:00 and 17:01 and then have a FetchFile and PutFile running all the time so these immediately consume all the content for the listed files and write them to target directory. Then your listFile does not execute again until same time next day or whatever you cron is. This way the files are all listed at same time and the putFile can execute for as long as needed to write all those files to the target directory. Hope this helps, Matt

MattWho · ‎10-07-2021

@CodeLa @SAMSAL I want to point out that tracking timestamps will not always guarantee NiFi will consume all files from the input file directory depending on how they are being placed in that directory. The ListFile processor looks at the last modified timestamp on the file. It then lists all files since the last recorded timestamp stored in NiFi state manager from the previous processor execution. On first run their will be no state and this everything currently is listed. Now consider the scenarios below which can affect above from listing all files: The mechanism that is writing the files to that inout directory is not updating the last modified timestamp on the file once it is done writing to it. Let say we have file 1 that starts being written to as 12:00:01.000 and file 2 that starts being written as 12:00:01.300. File 2 completes first and is consumed by listFile and stored state is updated to reflect 12:00:01.300. Now File 1 completes, but is never consumed by ListFile since its last modified timestamp is older than file 2. If you are in such a scenario, the ListFile offers a different "Listing Strategy" called "Tracking Entities" which tracks filenames as well in a cache service which allows it to still list files that may have an older timestamp. Another thing to consider is listFile may list the same file more than once. Consider this scenario: You tell NiFi ListFile to list files from directory /nifi/myfiles/. The mechanism writing these files to the target directory does update the last modified timestamp as file is being written, but does not use a ".<filename>" (dot rename) approach to writing these files (means file is initially a hidden file until file write completes and then is renamed and made unhidden. Default listFile config ignores hidden files). So when ListFile runs, it sees that file with newer last modified timestamp and lists it. Then on next execution it sees same file again because its last modified timestamp is updated as file is still being written to. If you are in such a scenario, you would want to make use of the "Minimum File Age" property. This property tells the listFile to ignore any files were the last modified time stamp when compared to current time is not at least that configured amount of time old (that means last modified timestamp has not changed for configured amount of time). That configured time is arbitrary and what ever length is needed for you to be confident file write was complete. Something else you need to consider depends on if both the following are true: 1. You are using a multi node NiFi cluster 2. The configured directory you are listing from is mounted to every node. Since every node in a NiFi cluster is executing the same dataflow, you want to avoid every node from listing the same files. IN this scenario you would change the "Execution" configuration from "All nodes" to "Primary" on the ListFile and change "input Directory location" from "local" to "remote". Then you will want to set "load balance Strategy" to "Round Robin" on the connection between ListFile and FetchFile. NOTE: Never set the Execution on any processor that has an inbound connection to "Primary node". ONLY processor with not inbound connection should be considered for this execution configuration. I know this is a lot to digest, but very important to be aware of to ensure success. If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post. Thank you, Matt

MattWho · ‎10-07-2021

@Ronman I don't know anything about the cetic/helm-nifi image. I assume you are talking about this: https://github.com/cetic/helm-nifi I started parsing through what is in the above github and the authorizers.xml that is built looks poorly done. Can you share what is in your authorizers.xml file on your NiFi host? The image looks to create numerous providers that are not actually used from what I can tell. Looks like it creates: 1. file-user-group-provider 2. ldap-user-group-provider (only if ldap enabled) 3. composite-configurable-user-group-provider (Only if ldap enabled) 4. file-access-policy-provider. (Always points at file-user-user-group-provider which means 2 and 3 would never get used even if they were created) 5. managed-authorizer (points at file-access-policy-provider) 6. file-provider (only if LDAP enabled. This is a legacy provider and not sure why anyone would still use it. It can reference and of the above user-group providers) So seeing what is actually written to that file might be helpful here. Also on startup the Authorizers.xml is responsible for seeding some initial polciy for the admin user in the users.xml and authorizations.xml files. This would including the intial set of policies for the root PG. This will not happen if upon first launch of NiFi there was not flow.xml.gz yet and thus no flow.xml.gz containing the root PG UUID yet. So you may want to rename your existing authorizations.xml file and restart your NiFi so that a new one is generated since you have a flow.xml.gz now and see if that gives you the policies you need to start editing the canvas. But even if above works, it still think you have an issue within your authorizers.xml files configuration. If you found this response assisted with your query, please take a moment to login and click on "Accept as Solution" below this post. Thank you, Matt

MattWho · ‎10-07-2021

@SAMSAL If it works initially and then begins to fail, I would start by taking a look at what is being reported on the azure side at the time of each exception. Is there more details in the nifi-app.log that goes along with the exception you shared? Cloudera does not have any official support releases yet built off of the Apache NiFi 1.14 release. I only see one change reported in Apache NiFi Jira related to this reporting task that happened in a release later then 1.11.x you reported no issues with: https://issues.apache.org/jira/browse/NIFI-6977 Nothing in that jira that went in to Apache NiFi 1.12 leads me to think it would result in the issue you are seeing. I also see you mentioned a couple properties from the nifi.properties file: nifi.web.https.(host, port, port.forwarding, ciphersuites.include, cyphersuites.exclude, network.interface*)? These properties are specific to NiFi's web interface and would not affect the functionality of this reporting task. nifi.remote.input.(host, secure,socket.port, http.enabled, http.transaction.ttl)? These properties are used for NiFi's Site-To-Site capability. They also would not be utilized by this particular reporting task. Do you see the same issues using Apache NiFi 1.13.2? Thank you, Matt

MattWho · ‎10-07-2021

@Ronman Please share the version of Apache NiFi you have installed. Thanks, Matt

Online	Offline
Last Visited	‎12-26-2025 01:55 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎12-26-2025 01:55 PM
Posts	3,406
Kudos received	1618

Cloudera Community

Re: Error importing NiFi workflow template from ve...

Re: Error importing NiFi workflow template from ve...

Re: How to elevate a default nifi user to admin - ...

Re: NiFi EnvokeHTTP - putting current date on HTTP...

Re: Invoking Nifi rest api in Data Flow

Re: Conditionally choose and trigger(start-stop) o...

Re: Nifi will be stuck when process large file

Re: NiFi : Getting error in ExecuteStreamCommand w...

Re: Nested JSON to insert in Relational Database

Re: PutFile

Re: Apache Nifi PutFile Processor

Re: How to schedule process to fetch only new file...

Re: No functionality in top toolbar after logging ...

Re: AzureLogAnalyticsReportingTask Failed to Publ...

Re: No functionality in top toolbar after logging ...