About MattWho

MattWho · ‎09-12-2018

# Processor Run Duration: Some processors support configuring a run duration. This setting tells a processor to continue to use the same task to work on as many FlowFiles (or batches of flowfiles) from an incoming queue in a single task. This is ideal for processors where the individual tasks themselves are completed very fast and the volume of FlowFile are large as well. In the above example, the exact same feed of FlowFiles were passed to both these processors which are configured to perform the same Attribute updated. Both processed the same number of FlowFiles in the past 5 minutes; however, the processor configured with a run duration consumed less overall CPU time to do so. Not all processors support setting a run duration. The nature of the processor function, the methods being used, and/or client lib used may not support this capability. You will not be able to set a run duration on such processors. How this works: Processor has thread assigned to its task. Processor grabs highest priority FlowFile (or batch of FlowFiles) from the “active queue” of the incoming connection. If processing of the FlowFile(s) does not exceed the configured run duration, another FlowFile (Flowfile batch) is pulled from the active queue. This process continues all under that same thread until run duration has been reached or “Active queue” is empty. At that time the session is completed and all outbound FlowFiles are committed at once to the appropriate relationship. Since no FlowFiles are committed until the entire run completes, Some latency is introduced on the FlowFiles. Your configured run duration dictates how much latency will occur at a minimum. If the execution of the processor against a FlowFile takes longer then the configured "run duration", there is no added benefit of adjusting this configuration. What this means for heap usage: Since it is only processing incoming FlowFiles in the “Active queue” there is no added heap pressure here. (FlowFiles in “active queue “ are already in heap space). The FlowFiles being generated (if any, depending on processor function) are all held in heap until the final commit. This may introduce some additional heap pressure versus not using a run duration since all those new FlowFiles being generated will be held in heap until they are all commited to an output relatiosnhip at the end of the run duration.

MattWho · ‎08-06-2018

@mojgan ghasemi - I recommend starting a new question for this question. This question was originally about tailFile and splitting files. It is best to keep one question per HCC post. - Thank you, Matt

MattWho · ‎08-02-2018

Have you ever noticed some lingering old rolled log files in your nifi logs directory that never seem to get deleted? This is a by-product of how logback works depending on how you have it configured. - Lets take a look at a default logback.xml configuration from NiFi: <appender name="APP_FILE"> <file>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app.log</file> <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">  <fileNamePattern>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app_%d{yyyy-MM-dd_HH}.%i.log</fileNamePattern> <maxFileSize>100MB</maxFileSize>  <maxHistory>30</maxHistory>  </rollingPolicy> <immediateFlush>true</immediateFlush> <encoder> <pattern>%date %level [%thread] %logger{40} %msg%n</pattern> </encoder> </appender> The above app log configuration will log to a file named nifi-app.log. Once that file reaches either 100 MB in size or crest the top of the hour, it will be rolled. You may end up with numerous log files within a single hour if there is an excessive amount of logging occurring in your NiFi. - A "maxHistory" of 30 means that the logger will only keep 30 hours (HH) of rolled logs. But that is not the full story here with how logback works. Not only does it control the number of hours to keep but also controls the max age of logs to evaluate for deletion. So the log files being left around that are more then 30 hours in age would be ignored when deletion thread ran. - So this naturally raises the question of how did these files get left behind in the first place? Typically this occurs if the file crest say 30 hours old while the application is stopped. When the application is restarted those older files end up getting ignored. - While the application is continuously running this works as one would normally expect. To simply clean-up these older rolled log files, you could run a touch command on them so their system file timestamp updates so they are no longer more then 30 hours old. They will then be considered within the 30 hour window and be deleted once the "maxHistory" count reaches 30. - However, above is not a permanent solution. I recommend instead to control file deletion by "totalSizeCap" setting (commented out by default in the NiFi logback.xml) It offers a couple of advantages: 1. The "%i" option in the fileNamePattern says to create sequential numbered log files every "maxFileSize" (100MB) within each hour. This help prevent any one log from getting to large, but has the downside of not being considered by "maxHistory" as individually counted files. So "maxHistory" set to 15 is 15 hours of logs even if each hour contains 2000 100MB log files. So you can see under heavy logging you can end up using a lot of logs space. 2. "TotalSizeCap" will start deleting old rolled log files as long as the log file date is less then "maxHistory" age. So lets say we want to retain up to 100GB of log history. We would set "maxHistory" to some very large value like 8760 (~1 year of hours) and set "totalSizeCap" to 100GB. Provided you hot 100GB before your hit 8760 hours. - Here is an example configuration: <appender name="APP_FILE"> <file>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app.log</file> <rollingPolicy>  <fileNamePattern>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app_%d{yyyy-MM-dd_HH}.%i.log</fileNamePattern> <maxFileSize>10MB</maxFileSize>  <maxHistory>8760</maxHistory>  <totalSizeCap>100GB</totalSizeCap>  <cleanHistoryOnStart>true</cleanHistoryOnStart> </rollingPolicy> <immediateFlush>true</immediateFlush> <encoder> <pattern>%date %level [%thread] %logger{40} %msg%n</pattern> </encoder> </appender> - Of course there is always a chance you could hit 8,760 hours worth of logs before reaching 100 GB of generated app logs, so you may need to tailor these setting based on app log sizes being generated by your particular running NiFi.

MattWho · ‎08-01-2018

@Romain Guay - I a not sure i am following your comments completely. Keep in mind that this Article was written against Apache NiFi 0.x versions. The look of the UI and some of the configuration/capabilities relevant to RPGs have changed as of Apache NIFi 1.x. - When you say "source NiFi", are you referring to the NiFi instance with the RPG or the NiFi instance with an input or output port? - Keep in mind the following: 1. The NiFi with the RPG on the canvas is always acting as the client. It will establish the connection to the target instance/cluster. 2. An RPG added to the canvas of a NiFi cluster is running on every node in that cluster with no regard for any other node in the cluster. 3. An RPG regularly connects the target NiFi cluster to retrieve S2S details which include number of nodes, load on nodes, available remote input/output ports, etc... (Even if URL provided in RPG is of a single node in the target cluster, the details collected will be for all nodes in target cluster). 4. A node distribution strategy is calculated based on the details collected. - During the actual sending of FlowFiles to a target NiFi instance/cluster remote input port, the number of FlowFiles sent is based on configured port properties in the RPG. So it may be the case that those settings are default, so FlowFiles are not load-balanced very well. - During the actual retrieving of FlowFiles from a target NiFi instance/cluster remote output port, the RPG will round-robin the node in the target NiFi pulling FlowFiles from the remote output port based on the port configuration properties in the RPG. So it may be that one source node has an RPG that run before the others and connects and is allocated all FlowFiles on the target remote output port before any other node in source Nifi cluster runs. There are some limitations in load-balancing using such a get/pull setup. - For more info on configuring your remote ports via the RPG, see the following article: https://community.hortonworks.com/content/kbentry/109629/how-to-achieve-better-load-balancing-using-nifis-s.html *** above article is based off Apache NiFi 1.2+ versions of RPG. - Thanks, Matt

MattWho · ‎07-25-2018

@Mohammad Soori - *** Forum tip: Please try to avoid responding to an Answer by starting a new answer. Instead use the "add comment" tp respond to en existing answer. There is no guaranteed order to different answers which can make following a response thread difficult especially when multiple people are trying to assist you. - Based on what you are showing me, your flow is working as designed. Since you have added all three outgoing relationships to the outgoing connection of the splitText processor, you would end up with duplication. - the "original" relationship is basically a passthrough for the incoming Flowfiles to splitText. This relationship is often auto-terminated unless you need to keep the original un-split flowfiles for something else in your flow. IN that case the original relationship would be routed within its own outbound connection and not in the same connection as "splits". - The fact that splitText is not really splitting your source Flowfiles (4 in and 4 out) tells me that the 4 source Flowfiles created do not contain any line returns from which to split that text. So the question is what does the output of one of these ~115 byte FlowFiles look like? - I also do not recommend routing the "failure" relationship along with "success" or "original" in the same connection. Should a failure occur, how would you easily separate what failed and what was successful. - Thank you, Matt

MattWho · ‎07-17-2018

@Saikrishna Tarapareddy - The only NiFi configuration file you can edit that will take affect without requiring a NiFi restart is the logback.xml file. - As far as what is an acceptable search base, best to test your search base command on command line using ldapsearch. If it doesn't work there, it will not work in NiFi either. - Thank you, Matt - If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer.

MattWho · ‎07-16-2018

@Tommy - *** Forum tip: Please try to avoid responding to an Answer by starting a new answer. Instead use the "add comment" tp respond to en existing answer. There is no guaranteed order to different answers which can make following a response thread difficult especially when multiple people are trying to assist you. Also use the @username when replying to make sure user gets notified about your response. - You need to answer the question: What links these two FlowFiles to one another? - Since you are evaluating FlowFiles in pairs. What if you get two Files of type A. How do you want the processor to know what file type B belongs with which of the two type files that already arrived? - If this is not a concern you could use a simple wait/notify flow as described here to accomplish this: https://gist.github.com/ijokarumawak/375915c45071c7cbfddd34d5032c8e90 - Thanks, Matt

MattWho · ‎07-16-2018

@mark juchems The ConsumeAzureEventHub processor was developed in the Apache community. From your description I did not realize it was growing non stop. It sounds like it was written in such away that is gets a thread upon initial execution and never releases that thread. If that is the case it will continue to produce FlowFiles to the output queue regardless of configured back pressure thresholds. - My suggestion would be to open an Apache Jira against that processor explaining the issue it is having and sharing your processor configuration. - Thank you, Matt

MattWho · ‎07-16-2018

@Mark Lin @mark juchems - The configurable backpressure thresholds (object and size) on a connection are soft limits. So a backpressure object threshold of 10,000 (default) means that the NiFi controller will not schedule the feeding processor to run if the object count has reached or exceeded 10,000 queued FlowFiles. - So lets say there are 9,999 queued objects. NiFi would allow the preceding processor to get scheduled. When that processor executes it code it will execute with no regard for destination queue sizes. That means if the execution of that processor thread results in 1,000,000 FlowFiles being processed in a single execution, all 1,000,000 FlowFiles will be added to that downstream connection queue. Now that the queue has 1,009,999 FlowFiles queued, the preceding processor will not be scheduled again until that queue drops below 10,000 again. - Same soft limit concept applies for the back pressure size threshold setting as well on a connection. - Thank you, Matt - When an "Answer" addresses/solves your question, please select "Accept" beneath that answer. This encourages user participation in this forum.

MattWho · ‎07-13-2018

@Nikhil What I was getting at was that the authentication methods are different here. - I am assuming your users who access the NIFi UI via the load balancer are using a user/password authentication method? That method results in a token being issued to the authenticated user which is then passed by the client in every subsequent request to the NiFi API. - With Site-To-Site, there are no tokens involved in the authentication process since certificate authentication occurs via two-way TLS in every single rest api call. - Admittedly, I know nothing about your specific LB or how it is configured, so these are just suggested things to consider. - Also want to let you know you must be running an older HDF version. Newer versions support editing the URL string without needing to recreate the RPG. - Thank you, Matt

Online	Offline
Last Visited	‎07-09-2026 11:37 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎07-09-2026 11:37 AM
Posts	3,472
Kudos received	1638

Cloudera Community

Re: ListenNetFlow processor does not decode Cisco ...

Re: Can we detect who did a particular operation i...

Re: How to invoke a url in nifi which is protected...

Re: Retry impacts scheduler

Re: 503 error while copying/versioning big process...

Understanding NiFi processor's "Run Duration" func...

Re: Nifi TailFile Processor does not detect vast i...

Understanding how the logback.xml configuration in...

Re: NiFi - Understanding how to use Process Groups...

Re: Nifi TailFile Processor does not detect vast i...

Re: Do we need to restart after a change to LDAP s...

Re: How to use Notify/Wait in PutFile in two diffe...

Re: nifi back pressure threshholds

Re: nifi back pressure threshholds

Re: Unable to autheticate to NIFI API using loadba...