About MattWho

MattWho · ‎12-20-2022

@AlexLasecki The issue here is unrelated to the copy and paste action taken. There is a bug in the code where the jsonPath cache is not cleared when the property value is changed after it has been initially set. So the same issue happens even if you do not copy and paste a splitJson processor configured with json path property value. All you need to do is change the json path value after after already having a value set. Original json path property value that is cached still gets used. The following bug jira has been created and work is already in progress to address the issue. https://issues.apache.org/jira/browse/NIFI-10998 For now as a workaround, you'll need to create a new SplitJson processor anytime you want to change the json path property value. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎12-19-2022

@AlexLasecki I can reproduce in 1.19.1 as well. Let me look more in to this issue. I'll respond again once I determine issue here. Thanks, Matt

MattWho · ‎12-19-2022

@anton123 It may be helpful if you described your use case in detail. Looking at your dataflow above, I am not clear on what you are trying to accomplish. 1. Your updateAttribute processor is changing the filename on every NiFi FlowFile that passes through it to "bu_service_template". Why are you doing this? 2. It makes no sense to me that you are looping the "original" relationship in a connection back on the MergeContent processor. All FlowFiles that go into a merged FlowFile get sent to this relationship. This means this loop would just grow and grow in size. Each Merged FlowFile that gets sent to the "merged" relationship connection would just get larger and larger. 3. Your MergeContent processor configuration is not ideal and in worst case scenario where it actually tries to merge the max configure number of Entries is likely to cause your NiFi to run out of memory since the FlowFile attributes/metadata for every FlowFile allocated to a merge bin is held in heap memory. 4. When trying to merge that many FlowFile, it is important to handle this in a series of mergeContent processors (one after another). Configure first to produce merged FlowFiles of maybe 10,000. Then another that merges yet another 10,000 and finally one last mergeContent that merges 10. The final merged FlowFiles would be 1 billion. 5. I see you are trying to use a correlation attribute in your MergeContent with attribute name "bu_service_template". Where in your dataflow is this attribute getting added to the inbound FlowFiles? 6. Keep in ind that MergeContent will execute as fats as possible and you have min entries set to 1. So it is very possible that at time of execution is sees only one new unbinned FlowFile in inbound connection and adds that to bin. Well now that bin has satisfied the min and thus would be merged with only a single FlowFile in it. So 1 FlowFile goes to merge relationship and 1 FlowFile that made up the merged FlowFile goes to "original" relationship to get merged again. A better configuration would be to set min to 10000 and max to 10000. Then you can also set a "max bin age". The max bin age is used to force a merge even if mins have not been satisfied after x configured amount of time. 3. I am not sure the role you are trying to accomplish with the wait processor after the MergeContent. I hope some of this configuration guidance helps you with yoru use case. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎12-16-2022

@PradNiFi1236 NiFi is designed to be data agnostic. So content that NiFi ingested is preserved in binary format wrapped in a NIFi FlowFile. It then becomes the responsibility of an individual processor that needs to operate against the content to understand the content type. The mergeContent processor does not care about the content type. This processor numerous merge strategies: - Binary concatenation simply writes the binary data from one FlowFile to the end of the binary data from another. There is no specific handling based on the content type. So combining two PDF files in this manor is going to leave you with unreadable data which explains why even with the larger content type of the merged FlowFile, you still only see the first PDF in the merged binary. - Tar and zip combines multiple pieces of content in to a single tar file. You can then later untar or unzip this to extract the multiple separate pieces of content it contains. So would preserve both independent PDF files. - FlowFile stream is unique to NiFi and merges multiple NiFi FLowFiles (A FlowFile consist of content and FlowFile metadata. This strategy is only used to preserve that NiFi metadata with the content for future access by another NiFi. - Avro expects the content being merged is already of Avro format. This will properly merge Avro type data in to single new Avro content. So the question here is first, how would you accomplish the merging of two PDF outside of NiFi. Then investigate how to accomplish the same within NiFi, if possible. TAR and ZIP will work to get you one output file; however, if your use case is to produce 1 new PDF form 2 original PDFs, mergeContent is not going to do that. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎12-16-2022

@Mosoa Before upgrading Apache NiFi you should read through the migration guide to include all releases between your current release and the release you are upgrading to. https://cwiki.apache.org/confluence/display/NIFI/Migration+Guidance The Hive nar was removed as of Apache NiFi 1.17.0, so I am guessing your previous version was at least 1.16 or older. As far as downloading and adding additional nars to NiFi, it is very easy. 1. Go to https://search.maven.org/ in your browser. 2. Search for Apache "NiFi Hive nar" 3. A list of artifacts will be shown, you'll need to click on all those you need, but I would start with "nifi-hive-nar" and "nifi-hive-services-api.nar" by clicking on the version number below each. 4. From the nar specific page you will see a "Downloads" option in the upper right corner of the page: 5. When you click on it three option appear. Select "nar". 6. Place the downloaded nar files in to the "lib" directory of your NiFi 1.19.1 installation. You'll notice that this directory already contains nar for other component classes already included with the base download. 7. Make sure ownership and permissions on these new nar files match other nars in the "lib" directory. 8. Start your NiFi 1.19.1 Now you will see the Hive components available in your NiFi UI: If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎12-13-2022

@sathish3389 Routing based on a sensitive value is an unusual use case. I'd love to hear more about this use case. Ultimately the RouteOnAttribute processor expects a boolean NiFi Expression Language Statement. So you want to have a sensitive parameter value that is evaluated against something else (another attribute on the inbound FlowFile) and if true route to a new relationship. Is what you are comparing this sensitive parameter value against also sensitive? If so, how are you protecting it as Attributes on FlowFiles are not sensitive and stored in plaintext. The ability to use Sensitive Parameters in dynamic properties (non password specific component properties) was added via https://issues.apache.org/jira/browse/NIFI-9957 in Apache NiFi 1.17.0. While this change created the foundation for such dynamic Property support for sensitive parameters, individual components need to be updated to utilize this new capability. As you can imagine with well over 300+ components available to NiFi, this is a huge undertaking. So what i see in the apache community are changes based on specific use case requests. I'd recommend creating an Apache NiFi Jira detailing your use case and working with the Apache Community to adopt that use case change to the RouteOnAttribute processor to support dynamic property support for Sensitive parameters. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎12-13-2022

@MaarufB You must have a lot of logging enabled that you expect multiple 10MB app.log files per minute. Was NiFi ever rolling files? Check your NiFi app.log for any Out of Memory (OOM) exceptions. Does not matter what class is throwing the OOM(s), once the NiFi process is having memory issues, it impacts everything within that service. If this is the case, you'll need to make changes to your dataflow(s) or increase the NiFi heap memory. Secondly, check to make sure you have sufficient file handles for your NiFi process user. For example; - If your NiFi service is owned by the "nifi" user, make sure the open file limit is set to a very large value for this user (999999). A restart of the NiFi service before the change to file handles will be applied. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎12-12-2022

@Onkar_Gagre The Max Timer Driven Thread pool setting is applied to each node individually. NiFi nodes configured as a cluster ate expected to be running on same hardware configuration. The guidance of 2 to 4 times the number of cores as starting point is based on the cores of a single node in the cluster and not based off the cumulative cores across all NiFi cluster nodes. You can only reduce wait time as you reduce load on the CPU. In most cases, threads given out to most NiFi processors execute for only milliseconds. But some processors operating against the content can take several seconds or much longer depending on function of the processor and/or size of the content. When the CPU is saturated these threads will take even longer to complete as the CPU is giving time to each active thread. Knowing the only 8 threads at a time per node can actually execute concurrently, a thread only gets a short duration of time before giving some to another. The pauses in between are the CPU wait time as thread queued up wait for their turns to execute. So reducing the max Timer Driven Thread count (requires restart to reduction to be applied) would reduce maximum threads sent to CPU concurrently which would reduce CPU wait time. Of course the means less concurrency in NiFi. Sometimes you can reduce CPU through different flow designs, which is a much bigger discussion than can be handle efficiently via the community forums. Other times, your dataflow simply needs more CPU to handle the volumes and rates you are looking to achieve. CPU and Disk I/O are the biggest causes of slowed data processing. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎12-09-2022

@F_Amini @Green_ Is absolutely correct here. You should be careful when increasing concurrent tasks as just blindly increasing it everywhere can have the opposite effect on throughput. I recommend stopping setting the concurrent tasks back to 1 or maybe 2 on all the processor where you have adjusted away from the default of 1 concurrent task. Then take a look at the processor further downstream in your dataflow where it has a red input connection but black (no backpressure) outbound connections. This processor s @Green_ mentioned is the processor causing all your upstream backlog. Your'll want to monitor your CPU usage as you make small incremental adjustments to this processors concurrent tasks until you see the upstream backlog start to come down. If while monitoring CPU, you see it spike pretty consistently at 100% usage across all your cores, then your dataflow has pretty much reached the max throughput it can handle for yoru specific dataflow design. At this point you need to look at other options like setting up a NiFi cluster where this work load can be spread across multiple servers or designing your datafow differently with different processors to accomplish same use case that may have a lesser impact on CPU (not always a possibility). Thanks, Matt

MattWho · ‎12-09-2022

@Onkar_Gagre Let's take a look at concurrent task here.... You have a an 8 core machine. You have a ConsumeKafka configured with 8 concurrent tasks and 4 nodes. I hope this means your Kafka topic has 32 partitions because that processor creates a consumer group with the 8 consumers from each node as part of that consumer group. Kafka will only assign one consumer from a consumer group to 1 partition. So having more consumer then partitions gains you nothin, but can cause performance issues caused by rebalance. Then you have a QueryRecord with 40 Concurrent tasks per node. Each allocated thread across your entire Dataflow needs time on the CPU. So just between these two processor alone, you are scheduling up to 48 concurrent threads that must be handled by only 8 cores. Based on your description of data volume, it sounds like a lot of CPU wait when enable this processor as each thread is only get a fraction of time on the CPU and thus taking long to complete its task. Sounds like you need more Cores to handle your dataflow and not necessarily an issue specific to the use of the QueryRecord processor. While you maybe scheduling concurrent tasks too high for your system on the QueryRecord processor, The scheduled thread come from the Max Timer Driven Thread pool set in yoru NiFi. The default is 10 and I assume you increased this higher to accommodate the concurrent tasks you have been assigning to your individual processors. The general starting recommendation for the Max Timer Driven Thread pool setting is 2 to 4 Times the number of cores on your node. So with an 8 core machine that recommendation would be 16 - 32. The decision/ability to set that even higher is all about your dataflow behavior along with your data volumes. It requires you to monitor cpu usage ad adjust the pool size in small increments. Once CPU is maxed there is nothing much we can do with create more CPU. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

Online	Offline
Last Visited	‎07-14-2026 12:19 PM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎07-14-2026 12:19 PM
Posts	3,472
Kudos received	1638

Cloudera Community

Re: ListenNetFlow processor does not decode Cisco ...

Re: Can we detect who did a particular operation i...

Re: How to invoke a url in nifi which is protected...

Re: Retry impacts scheduler

Re: 503 error while copying/versioning big process...

Re: Bug when copying a SplitJson processor

Re: Bug when copying a SplitJson processor

Re: Apache NIFI - How to wait for SQL Insert full ...

Re: How to merge two pdf files from two flow file...

Re: Missing Hive Processors In Nifi 1.19.1

Re: How to call senstive parameter from parameter ...

Re: nifi-app.log is exceeding the maxFileSize

Re: NIFI cluster with QueryRecord processor using ...

Re: Nifi queues are full and create backlogs

Re: NIFI cluster with QueryRecord processor using ...