About MattWho

Bgrilher · ‎03-08-2023

Perfect it whas exactly what I missed. BR, Bruno

MattWho · ‎03-08-2023

@New_User There is the following resolved jira for the creation of a new putIceberg NiFi processor: https://issues.apache.org/jira/browse/NIFI-10442 Apache has size limits on distributions and it does not appear as though this processor nar was included in the Apache release. You would need to add the nifi-iceberg-processors-nar manually to your NiFi installation for it become available for use. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎03-01-2023

@tkchea NiFi Remote Process Groups (RPG) transfer FlowFiles and not just the FlowFile content. So depending on the amount of metatdata/attributes on the FlowFile. the amount transferred would be larger. The RPG fetches Site-to-Site (S2S) details via a background thread the runs every 30 seconds regardless of existence of FlowFile. These S2S details fetched will include details on the target NiFi (Number of nodes in target cluster, load on each node, RAW ports if configured, If HTTP is enabled, etc..). These details are then used to facilitate the transfer of FlowFiles from client (RPG) and target NiFi (with Remote input or output ports). The actual transfer of FlowFile will either happen over the HTTPS port (used by a lot of other transactions) or via a RAW socket port depending on configuration. Since a FlowFile consists of two parts (FlowFile Metadata and FlowFile Content), there is going to be disk and CPU I/O involved with writing to the flowfile_repository and content_repository. So you may want to monitor those on both source and destination. When it comes to the mutual TLS handshake, NiFi is not doing anything special here. The client certificate presented is used to identify the client and verify authorization to the send to or pull from a remote port. You can also enable ssl handshake debug logging in the nifi bootstrap.conf file. java.arg.ssldebug=-Djavax.net.debug=ssl,handshake Of course you see all SSL handshakes including those when someone access the NiFi UI in the nifi-bootstrap.log file. But this would allow you to see if you are seeing systematic slow TLS handshakes or only between these two networks. You could also setup an RPG that sends to a remote input port on the same NiFi server. The same TLS handshake will happen there as well. Is it much faster (rules out an RPG issue.) If it ends up being the network between NiFi servers, you'll need to investigate there perhaps using something like wireshark may help. Another test might involve using a postHTTP or InvokeHTTP to send to a ListenHTTP or HandleHTTPRequest processor on target server (can be setup to be secure or insecure using same keystore and truststore your NiFi's use). If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎03-01-2023

@Girish007 Did you make sure that the NiFi directories and repository directories are excluded from any virus software scanning? That is typically the external force that is likely to being making changes to these files. Do you have any other external software or processes scanning or access these directories? Thanks, Matt

MattWho · ‎02-28-2023

@TRSS_Cloudera The issue you have described links to this known issue reported in Apache NiFi" https://issues.apache.org/jira/browse/NIFI-10792 The discussion found in the comments of this jira point to a couple workarounds which includes the negatives of each. From that discussion it appears the best approach is development of a new "Excel Record Reader" controller service that could be used by the existing ConvertRecord processor and CSVRecordSetWriter. This is outlined in following jira: https://issues.apache.org/jira/browse/NIFI-11167 If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎02-27-2023

@memad ListSFTP does not actually fetch the content of any files from the target SFTP server. You would need a FetchSFTP processor after listSFTP to do that. The ListSFTP processor results in FlowFile(s) with metadata about the the listed file from the SFTP processor. This metadata is then used by the downstream FetchSFTP processor to retrieve the actual content for each FlowFile. The documentation for the ListSFTP processor covers the attributes that are written to the FlowFile(s): https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.19.1/org.apache.nifi.processors.standard.ListSFTP/index.html This metadata is present on the FlowFile as FlowFile attributes, and you can manipulate and do anything you like with this metadata. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎02-21-2023

@PurpleK It is not clear what you mean when you say "Files that are in the 500GB+ range are taking several hours to move onto the unpack stage.". So FlowFile(s) are released to a downstream connection until processing of the source file is complete. The source file will still be represented in the queued count of the connection feeding a processors even while that processor is executing on that FlowFile. When you moving on to unpack stage, are you referring to some upstream processor feeding the connection to the UnpackContent processor taking awhile to queue some FlowFile on that downstream connection, or are you referring to once the file is queued it take awhile for unpack to complete execution on it creating on the unpacked FlowFiles and then remove original zip from upstream connection queue? Step 1 is identify the exact place(s) it is slow. Adding additional concurrent tasks to a processor has no impact on speeding up the execution on a specific source FlowFile. 1 thread get assigned to each execution of the processor and in the case of unpackContent, each tread executes against 1 FlowFile from upstream connection. Adding multiple concurrent tasks will allow multiple upstream FlowFiles to be processed concurrently. IMPORTANT: Increment concurrent tasks slowly while monitoring CPU load averages. Adding too many concurrent tasks on any one processor can impact other processors in your dataflow Event Driven Processor scheduling strategy is deprecated and will eventually go away (mostly like next major release) and should not be used. So increasing the Max Event driven Thread count under controller settings will have no impact unless you are using that strategy in your flow. It does create event threads, but they would not consume CPU if you are not using event driven scheduling anywhere in your dataflow(s). NiFi is a data agnostic service, meaning it can handle any stat type in its raw binary format. NiFi can do this because it wraps that binary content in a NiFi FlowFile. A NiFi FlowFile is what you see moving form processor to processor in your dataflows and int becomes the responsibility of the processor to understand the FlowFile's content should it need to read it. I bring this up because a FlowFile adds a small bit of overhead as it has to generate FlowFile metadata for every FlowFile created. When it comes to your 500GB+ zip files... 1. Do they consist of many small and/or large files? NiFi must create a FlowFile for each file that results from unpacking the original zip. 2. Do you see a lot of Java Garbage Collection (GC) pauses happening? All GC is stop the world. GC is normal operation or any JVM, but if GC is happening very often it can impact flow performance with constant pauses due to stop the world nature of GC. The larger the JVM memory that longer the stop the world event will be. 3. Any exceptions in your niif-app.log? You may also find this article helpful, it is old but majority of guidance is still very valid. Latest NiFi version support Java 8 and Java11, so you can ignore the G1GC recommendations if your are using Java 11. https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-high/ta-p/244999 Hopefully the concurrent tasks on your processor(s) excuting against the content of lareg FlowFiles will help you better utilize your hardware and achieve overall better throughput. Keep in mind that it only allows concurrent execution on multiple source FlowFiles, so will not improve speed at which a single FlowFile will be processed by a given processor. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

JohnF · ‎02-13-2023

It just seems odd that this policy isn't created by default as this is part of the REST api. Or better documented in the REST API docs...

MattWho · ‎02-13-2023

Not ruling out something environmental here, but what is being observed is validation working and processor execution not while both those processes should be using the same basic code. The 3 loggers that would produce Debug logging output suggested in my previous post may shed more light on the difference in the output logging when validation is done versus running (starting) the processor. So that is probably the best place to start.

MattWho · ‎02-13-2023

@Aprendizado The GetMongo processor (Assume this is what you are using) utilize a Mongo client library and not something custom written in NiFi. So limiting returns needs to be something that client library supports. The good news is that Mongo "limit" which is exposed by the processor should work for your use case (never tried this myself). Example us setting a Mongo limit based on time: https://stackoverflow.com/questions/8835757/return-query-based-on-date Now the GetMongo processor does support and inbound connection which means a source FlowFile could be used to trigger each execution of the GetMongo processor. The "limit" property in the GetMongo processor also supports NiFi Expression Language (NEL), which means that the limit could be set dynamically on the source trigger FlowFile and passed to the GetMongo on each execution. This means that after a successful run, you would need to extract from your Mongo results the date from which next execution needs to start. You could write that date for example to a Distributed Map cache using putDistributedMapCache processor. Then at beginning of your dataflow use a GenerateFlowFile --> FetchDistributedMapCache --> updateAttribute --> GetMongo to retrieve latest date that needs to be put in the limits configuration you pass to the GetMongo processor for next execution. The GenerateFlowFile scheduling controls the execution of this flow, so configure cron to control how often it creates the FlowFile that triggers yorur dataflow. Hopefully this gives you an idea of how you can accomplish your use case. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

Online	Online
Last Visited	‎02-02-2026 12:19 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎02-02-2026 12:19 AM
Posts	3,427
Kudos received	1628

Cloudera Community

Re: Setting TTL per key when writing to redis

Re: Best Practice for configuring registry flows

Re: Nifi 2.7.2 Start Problem

Re: Error importing NiFi workflow template from ve...

Re: Error importing NiFi workflow template from ve...

Re: ValidateJson Log in case of validation failure

Re: Connectivity to iceberg from nifi

Re: NiFi S2S extremely slow handshake

Re: ERROR [Index Provenance Events-1]Write.lock er...

Re: ConvertExcelToCSVProcessor - File too Large

Re: Nifi ListSFTP cash duration

Re: Improving Nifi Perfomance

Re: API call to /nifi-api/resources results in "N...

Re: Apache nifi : ListSFTP processor raise SSH aut...

Re: running from current date on Nifi