Member since
07-30-2019
3466
Posts
1641
Kudos Received
1015
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 393 | 03-23-2026 05:44 AM | |
| 308 | 02-18-2026 09:59 AM | |
| 556 | 01-27-2026 12:46 PM | |
| 968 | 01-20-2026 05:42 AM | |
| 1286 | 01-13-2026 11:14 AM |
12-07-2018
01:12 PM
@Nicolas Osorio . You are correct. I wrote this article some time ago. The use of the less secure aes128 is not going to be accepted by Newer versions of browsers and NiFi. Switching to a more secure aes256 will resolve issue.
... View more
11-16-2018
01:06 PM
Article content updated to reflect new provenance implementation recommendation and change in JVM Garbage Collector recommendation.
... View more
11-13-2018
06:50 PM
@Félicien Catherin Your observations are correct. The prioritizer only works against the FlowFiles currently in the "Active" queue. Because this active queue resides in JVM heap, the reordering based on priority comes at very little expense to performance. Re-evaluating all swapped FlowFiles each time a new FlowFile enters a connection would have a hit to throughput performance. - The first question I would ask is why the queue is so large? - Increasing the configured swap threshold in nifi.properties file will allow more FlowFiles to be held in the active queue, but that comes at an expense of high heap usage by your NiFi. - Keep in mind that the best throughput will always be obtained when no prioritizers are defined on a connection.
... View more
11-12-2018
06:08 PM
3 Kudos
NiFi Restricted components are those processors, controller services, or reporting tasks that have the ability to run user-defined code or access/alter localhost filesystem data. - The NiFi User guide explains this as follows: ----------------------------------------- Restricted components will be marked with a icon next to their name. These are components that can be used to execute arbitrary unsanitized code provided by the operator through the NiFi REST API/UI or can be used to obtain or alter data on the NiFi host system using the NiFi OS credentials. These components could be used by an otherwise authorized NiFi user to go beyond the intended use of the application, escalate privilege, or could expose data about the internals of the NiFi process or the host system. All of these capabilities should be considered privileged, and admins should be aware of these capabilities and explicitly enable them for a subset of trusted users. Before a user is allowed to create and modify restricted components they must be granted access. ------------------------------------------ Users can only be restricted from adding such components in NiFi if NiFi has to be secured. Users of an unsecured NiFi will always have access to all components. - Prior to HDF 3.2 or Apache NiFi 1.6, all restricted components were covered by a single authorization policy: Ranger Policy (Base policies): NiFi Policies (Hamburger menu) Ranger permissions description: /restricted-components Access restricted components Read/View - N/A Write/Modify - Gives granted users the ability to add components to the canvas that are tagged as “restricted” - It was decided that lumping all components into one policy was not ideal. So NIFI-4885 was created to address this so that users' access to restricted components would be based on the level of restricted access they are being granted. read-filesystem read-distributed-filesystem write-filesystem write-distributed-filesystem execute=code access-keytab export-nifi-details - In order to avoid backward compatibility issues when users upgrade to a HDF 3.2+ or Apache NiFi 1.6.0+, the “Access restricted components” base policy still exists and defaults to "regardless of restrictions". In the NiFi global “Access Policies” UI, this is the default policy and is depicted as follows: In Ranger, this is still associated with just the “/restricted-components” policy. The four new policies are depicted as follows in Ranger and NiFi UIs: - Ranger Policy (Base policies): NiFi Policies (Hamburger menu) Ranger permissions description: /restricted-components/read-filesystem Access restricted componentsSub policy:Requiring ‘read filesystem’ Read/View - N/A Write/Modify - Allows users to create/modify restricted components requiring read filesystem. /restricted-components/read-distributed-filesystem Access restricted componentsSub policy:Requiring ‘read distributed filesystem’ Read/View - N/A Write/Modify - Allows users to create/modify restricted components requiring read distributed filesystem. /restricted-components/write-filesystem Access restricted componentsSub policy:Requiring ‘write filesystem’ Read/View - N/A Write/Modify - Allows users to create/modify restricted components requiring write filesystem. /restricted-components/write-distributed-filesystem Access restricted componentsSub policy:Requiring ‘write distributed filesystem’ Read/View - N/A Write/Modify - Allows users to create/modify restricted components requiring write distributed filesystem. /restricted-components/execute-code Access restricted componentsSub policy:Requiring ‘execute code’ Read/View - N/A Write/Modify - Allows users to create/modify restricted components requiring read filesystem. /restricted-components/access-keytab Access restricted components Sub policy:Requiring ‘access keytab’ Read/View - N/A Write/Modify - Allows users to create/modify restricted components requiring read filesystem. /restricted-components/export-nifi-details Access restricted components Sub policy:Requiring ‘export nifi details’ Read/View - N/A Write/Modify - Allows users to create/modify restricted components requiring read filesystem. - Below is a list of restricted components for each of the above sub-policies (current as of CFM 2.1.1 and Apache NiFi 1.13): Read-filesystem: NiFi component: Component type: Access provisions: FetchFile Processor Provides operator the ability to read from any file that NiFi has access to. TailFile Processor Provides operator the ability to read from any file that NiFi has access to. GetFile Processor Provides operator the ability to read from any file that NiFi has access to. - Read-Distributed-Filesystem: (Added NiFi 1.13) NiFi component: Component type: Access provisions: FetchHDFS Processor Provides operator the ability to retrieve any file that NiFi has access to in HDFS or the local filesystem. FetchParquet Processor Provides operator the ability to retrieve any file that NiFi has access to in HDFS or the local filesystem. GetHDFS Processor Provides operator the ability to retrieve any file that NiFi has access to in HDFS or the local filesystem. GetHDFSSequenceFile Processor Provides operator the ability to retrieve any file that NiFi has access to in HDFS or the local filesystem. MoveHDFS Processor Provides operator the ability to retrieve any file that NiFi has access to in HDFS or the local filesystem. - Write-filesystem: NiFi component: Component type: Access provisions: FetchFile Processor Provides operator the ability to delete any file that NiFi has access to. GetFile Processor Provides operator the ability to delete any file that NiFi has access to. PutFile Processor Provides operator the ability to write to any file that NiFi has access to. - Write-Distributed-Filesystem: (Added NiFi 1.13) NiFi component: Component type: Access provisions: DeleteHDFS Processor Provides operator the ability to delete any file that NiFi has access to in HDFS or the local filesystem. GetHDFS Processor Provides operator the ability to delete any file that NiFi has access to in HDFS or the local filesystem. GetHDFSSequenceFile Processor Provides operator the ability to delete any file that NiFi has access to in HDFS or the local filesystem. MoveHDFS Processor Provides operator the ability to delete any file that NiFi has access to in HDFS or the local filesystem. PutHDFS Processor Provides operator the ability to delete any file that NiFi has access to in HDFS or the local filesystem. PutParquet Processor Provides operator the ability to write any file that NiFi has access to in HDFS or the local filesystem. - Execute-code: NiFi component: Component type: Access provisions: ScriptedReportingTask Reporting Task Provides operator the ability to execute arbitrary code assuming all permissions that NiFi has. ScriptedLookupService Controller Service Provides operator the ability to execute arbitrary code assuming all permissions that NiFi has. ScriptedReader Controller Service Provides operator the ability to execute arbitrary code assuming all permissions that NiFi has. ScriptedRecordSetWriter Controller Service Provides operator the ability to execute arbitrary code assuming all permissions that NiFi has. ExecuteFlumeSink Processor Provides operator the ability to execute arbitrary Flume configurations assuming all permissions that NiFi has. ExecuteFlumeSource Processor Provides operator the ability to execute arbitrary Flume configurations assuming all permissions that NiFi has. ExecuteGroovyScript Processor Provides operator the ability to execute arbitrary code assuming all permissions that NiFi has. ExecuteProcess Processor Provides operator the ability to execute arbitrary code assuming all permissions that NiFi has. ExecuteScript Processor Provides operator the ability to execute arbitrary code assuming all permissions that NiFi has. ExecuteStreamCommand Processor Provides operator the ability to execute arbitrary code assuming all permissions that NiFi has. invokeScriptedProcessor Processor Provides operator the ability to execute arbitrary code assuming all permissions that NiFi has. - access-keytab: NiFi component: Component type: Access provisions: KeytabCredentialsService Controller Service Allows user to define a Keytab and principal that can then be used by other components. - Export-nifi-details: NiFi component: Component type: Access provisions: SiteToSiteBulletinReportingTask Reporting Task Provides operator the ability to send sensitive details contained in bulletin events to any external system. SiteToSiteProvenanceReportingTask Reporting Task Provides operator the ability to send sensitive details contained in Provenance events to any external system. - ***Note: Some components may be found under multiple sub-policies above. In order for a user to utilize that component, they must be granted access to every sub policy required by that component. - Exceptions in HDF 3.2 and Apache 1.7 and 1.8: In order to use the following components, users must have full access to all restricted components policies: NiFi component: Component type: Access provisions: PutORC Processor This component requires access to restricted components regardless of restriction. Apache Jira: NIFI-5815 - A full breakdown of all other NiFi Policies can be found here: NiFi Ranger based policy descriptions - Cloudera Community
... View more
Labels:
10-31-2018
01:07 PM
@sri chaturvedi That specific property has to do with how long NIFi will potentially hold on to content claims which have been successfully moved to archive. It works in conjunction with the "nifi.content.repository.archive.max.usage.percentage=50%" property. - Neither of these properties will result in the clean-up/removal of any content claims that have not been moved to archive. If the disk usage of the content repository disk has exceeded 50% the "archive" sub-directories in the content repo should all be empty. - This does not mean that the active content claims will not continue to increase in number until content repo disk is 10% full. - Nothing about how the content repository works will have any impact on NiFi UI performance. A full content repository will howvere impact the overall performance of your NiFi dataflows. - Hope this answers your question.
... View more
10-31-2018
12:55 PM
7 Kudos
NiFi's content repository will hold on to claims until there are no active FlowFiles still anywhere on the canvas referencing that claim. This can result in very old or very large claims being left in the content repo using up space. Something as small as a single 0 byte FlowFile that sitting in some queue may end up preventing a multi-gigabyte content claim from being removed. The intent of this article is help users understand how to find those active FlowFiles and clear them form your dataflow. - There is no simple UI feature or NiFi rest-api endpoint users can use to return information linking FlowFiles to exiting content claims; however, all is not lost. Users can add an additional indexed field to your provenance configuration to so that is starts indexing the "ContentClaimIdentifier" associated with each FlowFile event generated. This would give you the ability to use Provenance to identify all the FlowFiles associated to specific claim in the content repo. - To add this, simply add "ContentClaimIdentifier" to the list if existing indexed fields via the nifi.properties file. Here is the specific property you will be editing: - nifi.provenance.repository.indexed.fields=EventType, FlowFileUUID, Filename, ProcessorID, Relationship, ContentClaimIdentifier - NOTE: a restart of NiFi will need to occur before indexing of this new field will begin. NiFi will not go back and re-index existing events. It will only start adding this new indexed field to events created from this point forward. - You can then search your content repo for any content claim files that are of concern (for example searching for any very old or very large claims). Simply copy the claim number (filename of claim) and search for it via a provenance query. Once you have your list of FlowFile events all tied to same claim: - Above is example provenance query which returned below three FlowFiles: I would then need to look at lineage of each of those FlowFiles to see if any of them made it to a DROP event (A DROP event means it is no longer in your dataflows anywhere. Lineage can be displayed by clicking the "show lineage" icon to the right of any event from list . - Once you have found one or more with no DROP event, you will want to get details of last event by right clicking on it: From those details you can collect the Component ID of for the processor that produced that event. - - Go back to your main canvas and search on that component ID. Your FlowFile will be located in one of the outbound relationships to that component. - - As you can see my FlowFile was found to be in this "success" relationship for an UpdateAttribute processor. You will also notice that this connection contains many FlowFiles. If I only want to purge this one FlowFile I am going to need to insert a RouteOnAttribute processor to this flow that I configure to route only a FlowFile with a specific FlowFile UUID which I could also get from my provenance event details above. - The routeOnAttribute processor added property would as simple as: Property: purgeValue: ${uuid:equals('8297d10c-f6ca-4843-9593-320e5b265dd6')} - "purge" becomes the new relationship which you can then just auto-terminate. The "unmatched" relationship will get routed on in your flow and will contain every other FlowFile from this connection. - *** Please feel free to post any comments to this article of you have questions or see anything that needs to be added/clarified. - Thank you, Matt
... View more
Labels:
09-12-2018
12:50 PM
4 Kudos
# Max Timer Driven Thread Count and Max Event Driven Thread Count: - Out of the box NiFi sets the Max Timer Thread Counts relatively low to support operating on the simplest of hardware. This default setting can limit the performance of very large high volume dataflow that must perform a lot of concurrent processing. General guidance for setting this value is 2 - 4 times the number of cores available to the hardware on which the NiFi service is running. With a NiFi cluster where each server has different hardware (not recommended), this would be set to the highest value possible based on the server with the fewest cores. NOTE: Remember that all configurations you apply within the NIFi UI are applied to every node in a NiFi cluster. None of the settings apply as a total to the cluster itself. NOTE: The cluster UI can be used to see how the total active threads are being used per node. Closely monitoring system CPU usage over time on each of your cluster nodes will help you identify regular or routine spikes in usage. This information will help you identify if you can increase the “Maximum Timer Driven Thread Count” setting even higher. Just arbitrarily setting this value higher can lead to thread spending excessive time in CPU wait and not really doing any work. This can show as long tasks times reported in processors. - *** The Event Driven scheduling strategy is considered experimental and thus do not recommend that it is used at all. User should only be configuring their NiFi processors to use one of the Timer Driven scheduling strategies (Timer Driven or CRON Driven). - # Assigning Concurrent Tasks to processor components: - Concurrent task settings on processors should always start at default 1 and only be incremented slowly as needed. Assigning too many concurrent tasks to every processor can have an affect on other dataflows/processors. - Because of how the above works, it may appear to a user that they get better performance out of a processor by simply setting a high number of concurrent tasks. What they are really doing is simply stacking up more request in that large queue so the processor gets more opportunity to grab one of the available threads from the resource pool. What often happens is users with a processor only running with 1 concurrent task are affected (simply because of the size of the request queue). So that user increases their concurrent tasks. Before you know it the request queue is so large, no one is benefiting from assigning additional concurrent tasks. - In addition you may have processors that inherently have long running tasks. Assigning these processors lots of concurrent tasks can mean a substantial chunk of that thread pool is being used for an extended amount of time. This then limits the number of available threads from the pool that are trying to work through the remaining tasks in the queue.
... View more
Labels:
09-12-2018
12:30 PM
7 Kudos
# Processor Run Duration: Some processors support configuring a run duration. This setting tells a processor to continue to use the same task to work on as many FlowFiles (or batches of flowfiles) from an incoming queue in a single task. This is ideal for processors where the individual tasks themselves are completed very fast and the volume of FlowFile are large as well. In the above example, the exact same feed of FlowFiles were passed to both these processors which are configured to perform the same Attribute updated. Both processed the same number of FlowFiles in the past 5 minutes; however, the processor configured with a run duration consumed less overall CPU time to do so. Not all processors support setting a run duration. The nature of the processor function, the methods being used, and/or client lib used may not support this capability. You will not be able to set a run duration on such processors. How this works: Processor has thread assigned to its task. Processor grabs highest priority FlowFile (or batch of FlowFiles) from the “active queue” of the incoming connection. If processing of the FlowFile(s) does not exceed the configured run duration, another FlowFile (Flowfile batch) is pulled from the active queue. This process continues all under that same thread until run duration has been reached or “Active queue” is empty. At that time the session is completed and all outbound FlowFiles are committed at once to the appropriate relationship. Since no FlowFiles are committed until the entire run completes, Some latency is introduced on the FlowFiles. Your configured run duration dictates how much latency will occur at a minimum. If the execution of the processor against a FlowFile takes longer then the configured "run duration", there is no added benefit of adjusting this configuration. What this means for heap usage: Since it is only processing incoming FlowFiles in the “Active queue” there is no added heap pressure here. (FlowFiles in “active queue “ are already in heap space). The FlowFiles being generated (if any, depending on processor function) are all held in heap until the final commit. This may introduce some additional heap pressure versus not using a run duration since all those new FlowFiles being generated will be held in heap until they are all commited to an output relatiosnhip at the end of the run duration.
... View more
Labels:
08-02-2018
06:05 PM
3 Kudos
Have you ever noticed some lingering old rolled log files in your nifi logs directory that never seem to get deleted? This is a by-product of how logback works depending on how you have it configured. - Lets take a look at a default logback.xml configuration from NiFi: <appender name="APP_FILE">
<file>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app.log</file> <rollingPolicy class="ch.qos.logback.core.rolling.SizeAndTimeBasedRollingPolicy">
<!--
For daily rollover, use 'app_%d.log'.
For hourly rollover, use 'app_%d{yyyy-MM-dd_HH}.log'.
To GZIP rolled files, replace '.log' with '.log.gz'.
To ZIP rolled files, replace '.log' with '.log.zip'.
-->
<fileNamePattern>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app_%d{yyyy-MM-dd_HH}.%i.log</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<!-- Control the maximum number of log archive files kept and asynchronously delete older files -->
<maxHistory>30</maxHistory>
<!-- optional setting for keeping 10GB total of log files
<totalSizeCap>10GB</totalSizeCap>
-->
</rollingPolicy>
<immediateFlush>true</immediateFlush>
<encoder>
<pattern>%date %level [%thread] %logger{40} %msg%n</pattern>
</encoder>
</appender> The above app log configuration will log to a file named nifi-app.log. Once that file reaches either 100 MB in size or crest the top of the hour, it will be rolled. You may end up with numerous log files within a single hour if there is an excessive amount of logging occurring in your NiFi. - A "maxHistory" of 30 means that the logger will only keep 30 hours (HH) of rolled logs. But that is not the full story here with how logback works. Not only does it control the number of hours to keep but also controls the max age of logs to evaluate for deletion. So the log files being left around that are more then 30 hours in age would be ignored when deletion thread ran. - So this naturally raises the question of how did these files get left behind in the first place? Typically this occurs if the file crest say 30 hours old while the application is stopped. When the application is restarted those older files end up getting ignored. - While the application is continuously running this works as one would normally expect. To simply clean-up these older rolled log files, you could run a touch command on them so their system file timestamp updates so they are no longer more then 30 hours old. They will then be considered within the 30 hour window and be deleted once the "maxHistory" count reaches 30. - However, above is not a permanent solution. I recommend instead to control file deletion by "totalSizeCap" setting (commented out by default in the NiFi logback.xml) It offers a couple of advantages: 1. The "%i" option in the fileNamePattern says to create sequential numbered log files every "maxFileSize" (100MB) within each hour. This help prevent any one log from getting to large, but has the downside of not being considered by "maxHistory" as individually counted files. So "maxHistory" set to 15 is 15 hours of logs even if each hour contains 2000 100MB log files. So you can see under heavy logging you can end up using a lot of logs space. 2. "TotalSizeCap" will start deleting old rolled log files as long as the log file date is less then "maxHistory" age. So lets say we want to retain up to 100GB of log history. We would set "maxHistory" to some very large value like 8760 (~1 year of hours) and set "totalSizeCap" to 100GB. Provided you hot 100GB before your hit 8760 hours. - Here is an example configuration: <appender name="APP_FILE">
<file>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app.log</file>
<rollingPolicy>
<!--
For daily rollover, use 'app_%d.log'.
For hourly rollover, use 'app_%d{yyyy-MM-dd_HH}.log'.
To GZIP rolled files, replace '.log' with '.log.gz'.
To ZIP rolled files, replace '.log' with '.log.zip'.
-->
<fileNamePattern>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app_%d{yyyy-MM-dd_HH}.%i.log</fileNamePattern>
<maxFileSize>10MB</maxFileSize>
<!-- keepup to 8,760 hours worth of log files -->
<maxHistory>8760</maxHistory>
<!-- optional setting for keeping 100 GB total of log files -->
<totalSizeCap>100GB</totalSizeCap>
<!-- archive removal will be executed on appender start up -->
<cleanHistoryOnStart>true</cleanHistoryOnStart>
</rollingPolicy>
<immediateFlush>true</immediateFlush>
<encoder>
<pattern>%date %level [%thread] %logger{40} %msg%n</pattern>
</encoder>
</appender> - Of course there is always a chance you could hit 8,760 hours worth of logs before reaching 100 GB of generated app logs, so you may need to tailor these setting based on app log sizes being generated by your particular running NiFi.
... View more
Labels:
08-01-2018
04:54 PM
@Romain Guay - I a not sure i am following your comments completely. Keep in mind that this Article was written against Apache NiFi 0.x versions. The look of the UI and some of the configuration/capabilities relevant to RPGs have changed as of Apache NIFi 1.x. - When you say "source NiFi", are you referring to the NiFi instance with the RPG or the NiFi instance with an input or output port? - Keep in mind the following: 1. The NiFi with the RPG on the canvas is always acting as the client. It will establish the connection to the target instance/cluster. 2. An RPG added to the canvas of a NiFi cluster is running on every node in that cluster with no regard for any other node in the cluster. 3. An RPG regularly connects the target NiFi cluster to retrieve S2S details which include number of nodes, load on nodes, available remote input/output ports, etc... (Even if URL provided in RPG is of a single node in the target cluster, the details collected will be for all nodes in target cluster). 4. A node distribution strategy is calculated based on the details collected. - During the actual sending of FlowFiles to a target NiFi instance/cluster remote input port, the number of FlowFiles sent is based on configured port properties in the RPG. So it may be the case that those settings are default, so FlowFiles are not load-balanced very well. - During the actual retrieving of FlowFiles from a target NiFi instance/cluster remote output port, the RPG will round-robin the node in the target NiFi pulling FlowFiles from the remote output port based on the port configuration properties in the RPG. So it may be that one source node has an RPG that run before the others and connects and is allocated all FlowFiles on the target remote output port before any other node in source Nifi cluster runs. There are some limitations in load-balancing using such a get/pull setup. - For more info on configuring your remote ports via the RPG, see the following article: https://community.hortonworks.com/content/kbentry/109629/how-to-achieve-better-load-balancing-using-nifis-s.html *** above article is based off Apache NiFi 1.2+ versions of RPG. - Thanks, Matt
... View more