Member since
07-30-2019
3406
Posts
1621
Kudos Received
1006
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 84 | 12-17-2025 05:55 AM | |
| 145 | 12-15-2025 01:29 PM | |
| 99 | 12-15-2025 06:50 AM | |
| 224 | 12-05-2025 08:25 AM | |
| 380 | 12-03-2025 10:21 AM |
04-18-2018
02:04 PM
3 Kudos
@Rahoul
A
The rules that govern log rotation and retention are all configured in the NiFi logback.xml file. - If you find that log files are not being deleted, verify that the user running the NiFi service has sufficient open files handles allocated it to it. Not having enough file handles can result in failed log rotations and clean-up. - While logged in as the user who owns the NiFi process, execute the following command: ulimit -a - Recommended starting value is 50000. Depending on NiFi dataflow and flow volumes, this value may need to be even larger like 999999. - Thanks, Matt
... View more
04-17-2018
01:07 PM
1 Kudo
@Rahoul A Unfortunately, you can only have one client writing/appending to the same file in HDFS at a time. The nature of this append capability in HDFS does not mesh well with the NIFi architecture of concurrent parallel operations across multiple nodes. NiFi nodes each run their own copy of the dataflows and work on their own unique set of FlowFiles. While NiFi nodes do communicate health and status heartbeats to the elected cluster coordinator, dataflow specific information like which node is currently appending to a very specific filename in the same target HDFS cluster is not shared. And from a performance design aspect, it makes sense not to do this. - So, aside from the above work-around which reduces the likelihood of conflict, you can also: 1. After whatever preprocessing you perform on the data in NiFi before pushing to HDFS, route all data to the a dedicated node (with a failover node, think postHTTP with failure feeding another postHTTP) in your cluster for the final step of appending to your target HDFS. 2. Install an edge standalone instance of NiFi that simply receives the processed data from your NiFi cluster and writes/appends it to HDFS. - Thanks, Matt
... View more
04-17-2018
12:53 PM
1 Kudo
@sri chaturvedi While the above doc is intended to set you on the write path in terms of deploying a well implemented NiFi setup. It will not help with a your dataflow design implementation of hardware limitations. You have to monitor your systems while your dataflow is running for things like: 1. CPU utilization (If CPU utilization is always low, consider increasing the "Max Timer Driven Thread" pool allocated for your NiFi dataflow components. Maybe adding an extra Concurrent task here and there in your flow where there are bottlenecks also. Playing with processor run duration.) 2. Disk performance (specifically the crucial NiFi Repo and log disks) 3. Memory performance (Monitor Garbage Collection, are there lots of occurrence resulting considerable stop-the-world impact on your dataflow. If so, you may need to look at your dataflow design and look for mays to reduce heap usage.) 4. Network performance. Thanks, Matt
... View more
04-16-2018
06:59 PM
@Josh Nicholson NiFi use logback for logging of component classes in NiFi. It is possible to add a logger for that will turn off all logging for specific class but not a specific error. - First you need the complete java class name. Using simply "o.a.r.audit.provider.BaseAuditHandler" will not work. We need to figure out what the "o", "a", and "r" really are. To get NiFi to print out the entire class you are going to want to edit the following section in the logback.xml: <appender name="APP_FILE">
<file>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app.log</file>
<rollingPolicy>
<!--
For daily rollover, use 'app_%d.log'.
For hourly rollover, use 'app_%d{yyyy-MM-dd_HH}.log'.
To GZIP rolled files, replace '.log' with '.log.gz'.
To ZIP rolled files, replace '.log' with '.log.zip'.
-->
<fileNamePattern>${org.apache.nifi.bootstrap.config.log.dir}/nifi-app_%d{yyyy-MM-dd_HH}.%i.log</fileNamePattern>
<maxFileSize>100MB</maxFileSize>
<!-- keep 30 log files worth of history -->
<maxHistory>30</maxHistory>
<!-- optional setting for keeping 10GB total of log files
<totalSizeCap>10GB</totalSizeCap>
-->
</rollingPolicy>
<immediateFlush>true</immediateFlush>
<encoder>
<pattern>%date %level [%thread] %logger{40} %msg%n</pattern>
</encoder>
</appender> Third line from the bottom is the "pattern" line. Change "{40}" to "{140}". Save the change. - Tail the nifi-app.log and wait for next occurrence of the above error. - *** The good thing about the logback.xml file is that it can be edited while NiFi changes and those changes will take affect 30 secs to 2 minutes after you save the changes to the file. - Next we will want to edit the logback.xml file again to add a new logger: - look for the following line: <logger name="org.apache.nifi" level="INFO"/> - Then after this line add a new line similar to above except using the full class name you obtained: <logger name="org.apache.ranger.audit.provider.BaseAuditHandler" level="OFF"/> - Make sure you set "Level" to "OFF" (all caps). Save the file and you are done. - Thank you, Matt - **** If you found this answer addressed your question, please take a moment to login and click "accept"
... View more
04-16-2018
04:50 PM
1 Kudo
@dhieru singh This issue is likely being caused by https://issues.apache.org/jira/browse/NIFI-3389. The above bug was addressed stating in Apache NiFi 1.2.0 and HDF 2.1.2. The bug occurs when a NiFi attribute being written to the FlowFile repository is larger then 64 KB. This can usually map back to specific dataflow the end-user has designed that uses some NiFi processor capable of extracting actual content in to a FlowFile Attribute such as ExtractText or EvaluateJsonPath. Users must upgrade to newer version of NiFi with this fix or redesign their dataflow so they are not creating attributes larger then 64KB. Thank you, Matt
... View more
04-13-2018
05:21 PM
@Abdou B. HDF is never running exactly the same version of Apache NiFi as you would find in the open community. Each HDF release is based off an Apache release version as the baseline with many bugs and/or enhancements added on top. So you may find apache bugs that are fixed in Apache NiFi 1.6 which are already fixed in the HDF 3.1 release. Matt
... View more
04-13-2018
03:08 PM
1 Kudo
@Abdou B. Not sure I follow. We are talking about whitelisting configuration needed for NiFi and not Ambari. The specific nifi.properties property that is used to add a whitelist of allowed http headers is found in the NiFi admin guide: https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#web-properties Thanks, Matt
... View more
04-13-2018
12:25 PM
@r r Your use case raises a lot of questions. Is this a one time move of a single folder? If this is a one time move, Nifi may not be the fastest solution here. NiFi is designed primarily for efficient continuously running dataflows. NiFi has guaranteed delivery mechanisms and data tracking through built-in provenance. NiFi is also data agnostic. It accomplishes this by wrapping the bits of content in a FlowFile. There will be some overhead associated with these things. - What does a bunch of files mean? Are you talking a few hundred, thousand, or million files? - It is good to start by asking yourself "how would you do this without using NiFi?". - NiFi could be installed locally to the system where the files exist and use the listFile and FetchFile processors to consume the files. then push those files to the target system. - NiF could reside on target system and retrieve the files from the source files via listSFTP and FetchSFTP processors (just examples) - As far as making sure all files are moved, NiFi list based processors maintain state on what files have been listed. As long as NiFi has access to all the files in the source folder, it will list all of them. The corresponding fetch processor will then pull the content for each of those files (it has a failure relationship that can be routed for retry in the event a files fails to be fetched). Do you have a known count of source files? - What do you mean by visible? Visible to whom? This is where it becomes a little more tricky. Perhaps you could use NiFi to tar all the files together, Then move that tar to the new location and after successfully writing that tar or zip, use NIFi to execute a script on the target to unpack the tar or zip file. - Thank you, Matt - If you found this answer addressed your question, please take a moment to login and click "accept" on the answer.
... View more
04-12-2018
06:20 PM
13 Kudos
Short Description: A NiFi connection is where FlowFiles are temporarily held between two connected NiFi Processor components. Each connection that contains queued NiFi FlowFiles will have a footprint in the JVM heap. This article will breakdown a connection to show how NiFi manages the FlowFiles that are queued in that connection and that affects heap and performance. Article: First let me share the 10,000 foot view and then I will discuss each aspect of the following image: *** NiFi FlowFiles consist of FlowFile Content and FlowFile Attributes/metadata. FlowFile content is never held in a connections heap space. Only the FlowFile Attributes/metadata is placed in heap by a connection. The "Connection Queue": The connection queue is where all FlowFiles queued in the connection are held. To understand how these queued FlowFiles affect performance and heap usage, lets start by focusing on the "Connection Queue" dissection at the bottom of the above image. The overall size of a connection is controlled by the configured "back Pressure Object Threshold" and "Back Pressure Data Size threshold" settings the user defines per connection. Back Pressure Object Threshold and Back Pressure Data Size Threshold: The "Back Pressure Object Threshold" default setting here is 10000. The "Back Pressure Data Size Threshold" defaults to 1 GB. Both of these settings are soft limits. This means that they can be exceeded. As an example, lets assume default settings above and a connection that already contains 9,500 FlowFiles. Since the connection has not reached or exceeded object threshold yet, the processor feeding that connection will be allowed to run. If that feeding processor should produce 2,000 FlowFiles when it executes, the connection would grow to 11,500 queued FlowFiles. The preceding processor would then not be allowed to execute until the queue dropped below the configured threshold once again. The same hold true for the Data Size threshold. Data Size is based on the cumulative reported size of the content associated to each queued FlowFile. Now that we know how the overall size of "connection queue" is controlled, lets break it down it to its parts: 1. ACTIVE queue: FlowFiles enter a connection will initially begin to placed in the active queue. FlowFiles will continue to placed in to this queue until that queue has reached the global configured nifi swap threshold. All FlowFiles in the active queue are held in heap memory. The processor consuming Flowfiles from this connection will always pull FlowFiles from the active queue. The size of the active queue per connection is controlled by the following property in the nifi.properties file: nifi.queue.swap.threshold=20000 Increasing the swap threshold increase the potential heap footprint of every single connection in your dataflow(s). 2. SWAP queue: Based on above default setting, once a connection reaches 20,000 FlowFiles, new FlowFiles entering the connection are placed in the swap queue. The swap queue is also held in heap and is hard coded to 10,000 FlowFiles max. If space is freed in the active queue and no swap files exist, FlowFiles in the swap queue will be moved directly to the active queue. 3. SWAP FILES: Each time the swap queue reaches 10,000 FlowFiles, a swap file is written to disk that contains those FlowFiles. At that point new FlowFiles are again written to the swap queue. Many Swap files can be created. Using image above where connection contains 80,000 FlowFiles, there would be 30,000 FlowFiles in heap and 5 swap files. As the active queue has freed 10,000 FlowFiles, the oldest swap file are moved to the active queue until all swap files are gone. The fact that swap files must be written to and read from disk, having a lot of swap files being produced across your dataflow will affect throughput performance of your dataflow(s). 4. IN-FLIGHT queue: Unlike the above 3, the in-flight queue only exists when the processor consuming from this connection is running. The consuming processor will only pull FlowFiles from the active queue and place them in the in-flight queue until processing has successfully completed and those FlowFiles have been committed to an outbound connection from the consuming processor. This in-flight queue is also held in heap. Some processors work on 1 FlowFile at a time, others work on batches of FlowFile, and some have the potential of working on every single FlowFile on an incoming connection queue. In the last case, this could mean high heap usage while those FlowFiles are being processed. The example above is one of those potential case using the MergeContent processor. The MergeContent processor places FlowFiles from the active queue in virtual bins. How many bins an what makes a bin eligible for merge is governed buy the processor configuration. What is important to understand that is is possible that every FlowFile in the connection could make its way into the "in-flight queue". In image example, if the MergeContent were running, all 80,000 queued Flowfiles would likely be pulled in to heap via the in-flight queue. ---- Take away from this article: 1. Control heap usage by limiting size of connection queue when possible. (Of course if your intention is to merge 40,000 FlowFiles, there must be 40,000 Flowfiles in the incoming connection. However, you could have two mergeContent processors in series each merging smaller bundles with same end result with less overall heap usage.) 2. With default back pressure object threshold settings, there will be no swap files produced on most connections (remember soft limits) which will result in better throughput performance. 3. The default configured swap threshold of 20,000 is a good balance in most cases of active queue size and performance. For smaller flows you may be able to push this higher and for extremely large flows you may want to set this lower. Just understand it is a trade-off of heap usage for performance. But if your run out of heap, there will be zero performance. Thank you, Matt
... View more
Labels:
04-12-2018
02:20 PM
2 Kudos
@Bharadwaj Bhimavarapu
Processors within the body of a dataflow should NOT be configured to use the "Primary node" only "execution" strategy. The only processors that should be scheduled to run on "Primary node" only would be data ingest type processors that do not use cluster friendly type protocols. The most common non-cluster friendly ingest processors can be found to have "List<type>" processor names (ListSFTP, ListHDFS, ListFTP, ListFile, ....). - When a node is no longer elected as the primary node, it will stop scheduling only those processors set for "Primary node" only execution. All other processors will continue to execute. The newly elected primary node will begin executing its "Primary node" only scheduled processors. These processors generally are designed to record some cluster wide state on where previous primary node execution left off so the same processor executing on the new primary node picks up where other left off. - This is why it is important that any processor that takes a incoming connection from another processor is not scheduled for "Primary node" only execution. If primary node changes you still want original primary node to continue processing the data queued downstream of the "primary node" only ingest processor. - There is no way to specify a specific node in a NiFi cluster to be the primary node. It is important to make sure that any one of your nodes is capable of executing the primary node processors at any time. - Zookeeper is responsible for electing both the primary node and cluster coordinator in a NiFi cluster. If your GC cycles are affecting the ability of your nodes to communicate with ZK in a timely manor, this may explain the constant election changes by ZK in your cluster. My suggestion would be to adjust the ZK timeouts in NiFi here (defaults are only 3 secs which is far from ideal in a production environment). The following properties can be found in the nifi.properties file: nifi.zookeeper.session.timeout=60 secs
nifi.zookeeper.connect.timeout=60 secs *** If using Ambari to mange your HDF cluster, make the above changes via nifi configs in Ambari. - Thanks, Matt - If you found this answer addressed you initial question, please take a moment to login and click "accept" on the answer.
... View more