About MattWho

MattWho · ‎03-01-2023

@bmoisson @Sumit6620 When you authenticate via NiFi, there is both a client JWT token generated and a server side key generated on the node on which the authentication was performed. That Client JWT token can then be used to perform calls to rest-api endpoints on that node only for which that client is authorized. When you are obtaining your JWT token from an external authentication endpoint, NiFi won't have the server side token need to validate that token and thus rejects that token. You can find the various methods of authentication that can be configured in Apache NiFi here: https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html#user_authentication If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎03-01-2023

@Girish007 Did you make sure that the NiFi directories and repository directories are excluded from any virus software scanning? That is typically the external force that is likely to being making changes to these files. Do you have any other external software or processes scanning or access these directories? Thanks, Matt

MattWho · ‎02-28-2023

@TRSS_Cloudera The issue you have described links to this known issue reported in Apache NiFi" https://issues.apache.org/jira/browse/NIFI-10792 The discussion found in the comments of this jira point to a couple workarounds which includes the negatives of each. From that discussion it appears the best approach is development of a new "Excel Record Reader" controller service that could be used by the existing ConvertRecord processor and CSVRecordSetWriter. This is outlined in following jira: https://issues.apache.org/jira/browse/NIFI-11167 If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎02-27-2023

@memad ListSFTP does not actually fetch the content of any files from the target SFTP server. You would need a FetchSFTP processor after listSFTP to do that. The ListSFTP processor results in FlowFile(s) with metadata about the the listed file from the SFTP processor. This metadata is then used by the downstream FetchSFTP processor to retrieve the actual content for each FlowFile. The documentation for the ListSFTP processor covers the attributes that are written to the FlowFile(s): https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.19.1/org.apache.nifi.processors.standard.ListSFTP/index.html This metadata is present on the FlowFile as FlowFile attributes, and you can manipulate and do anything you like with this metadata. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎02-21-2023

@PurpleK It is not clear what you mean when you say "Files that are in the 500GB+ range are taking several hours to move onto the unpack stage.". So FlowFile(s) are released to a downstream connection until processing of the source file is complete. The source file will still be represented in the queued count of the connection feeding a processors even while that processor is executing on that FlowFile. When you moving on to unpack stage, are you referring to some upstream processor feeding the connection to the UnpackContent processor taking awhile to queue some FlowFile on that downstream connection, or are you referring to once the file is queued it take awhile for unpack to complete execution on it creating on the unpacked FlowFiles and then remove original zip from upstream connection queue? Step 1 is identify the exact place(s) it is slow. Adding additional concurrent tasks to a processor has no impact on speeding up the execution on a specific source FlowFile. 1 thread get assigned to each execution of the processor and in the case of unpackContent, each tread executes against 1 FlowFile from upstream connection. Adding multiple concurrent tasks will allow multiple upstream FlowFiles to be processed concurrently. IMPORTANT: Increment concurrent tasks slowly while monitoring CPU load averages. Adding too many concurrent tasks on any one processor can impact other processors in your dataflow Event Driven Processor scheduling strategy is deprecated and will eventually go away (mostly like next major release) and should not be used. So increasing the Max Event driven Thread count under controller settings will have no impact unless you are using that strategy in your flow. It does create event threads, but they would not consume CPU if you are not using event driven scheduling anywhere in your dataflow(s). NiFi is a data agnostic service, meaning it can handle any stat type in its raw binary format. NiFi can do this because it wraps that binary content in a NiFi FlowFile. A NiFi FlowFile is what you see moving form processor to processor in your dataflows and int becomes the responsibility of the processor to understand the FlowFile's content should it need to read it. I bring this up because a FlowFile adds a small bit of overhead as it has to generate FlowFile metadata for every FlowFile created. When it comes to your 500GB+ zip files... 1. Do they consist of many small and/or large files? NiFi must create a FlowFile for each file that results from unpacking the original zip. 2. Do you see a lot of Java Garbage Collection (GC) pauses happening? All GC is stop the world. GC is normal operation or any JVM, but if GC is happening very often it can impact flow performance with constant pauses due to stop the world nature of GC. The larger the JVM memory that longer the stop the world event will be. 3. Any exceptions in your niif-app.log? You may also find this article helpful, it is old but majority of guidance is still very valid. Latest NiFi version support Java 8 and Java11, so you can ignore the G1GC recommendations if your are using Java 11. https://community.cloudera.com/t5/Community-Articles/HDF-CFM-NIFI-Best-practices-for-setting-up-a-high/ta-p/244999 Hopefully the concurrent tasks on your processor(s) excuting against the content of lareg FlowFiles will help you better utilize your hardware and achieve overall better throughput. Keep in mind that it only allows concurrent execution on multiple source FlowFiles, so will not improve speed at which a single FlowFile will be processed by a given processor. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎02-13-2023

Not ruling out something environmental here, but what is being observed is validation working and processor execution not while both those processes should be using the same basic code. The 3 loggers that would produce Debug logging output suggested in my previous post may shed more light on the difference in the output logging when validation is done versus running (starting) the processor. So that is probably the best place to start.

MattWho · ‎02-13-2023

@lben if you saw a bulletin on the processor reporting a failure in execution, that should also be in the nifi-app.log. You can also modify the logback.xml to change the log level of NiFi or even just the ListSFTP processor class to hopefully capture more detail on the failure. Does SFTP to target server work from command line as the NiFi service user? SFTP is just FTP over SSH. But yes, SFTP servers can be configured to only allow SFTP connections. So to get more logging out of the listSFTP processor class you could add these loggers the area where all the other loggers start to show up in the NiFi logback.xml: <logger name="org.apache.nifi.processors.standard.ListSFTP" level="DEBUG"/> <logger name="net.schmizz.sshj" level="DEBUG"/> <logger name="com.hierynomus.sshj" level="DEBUG" /> Thanks, Matt

MattWho · ‎02-13-2023

@Aprendizado The GetMongo processor (Assume this is what you are using) utilize a Mongo client library and not something custom written in NiFi. So limiting returns needs to be something that client library supports. The good news is that Mongo "limit" which is exposed by the processor should work for your use case (never tried this myself). Example us setting a Mongo limit based on time: https://stackoverflow.com/questions/8835757/return-query-based-on-date Now the GetMongo processor does support and inbound connection which means a source FlowFile could be used to trigger each execution of the GetMongo processor. The "limit" property in the GetMongo processor also supports NiFi Expression Language (NEL), which means that the limit could be set dynamically on the source trigger FlowFile and passed to the GetMongo on each execution. This means that after a successful run, you would need to extract from your Mongo results the date from which next execution needs to start. You could write that date for example to a Distributed Map cache using putDistributedMapCache processor. Then at beginning of your dataflow use a GenerateFlowFile --> FetchDistributedMapCache --> updateAttribute --> GetMongo to retrieve latest date that needs to be put in the limits configuration you pass to the GetMongo processor for next execution. The GenerateFlowFile scheduling controls the execution of this flow, so configure cron to control how often it creates the FlowFile that triggers yorur dataflow. Hopefully this gives you an idea of how you can accomplish your use case. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

MattWho · ‎02-13-2023

@jricogar Why not use the listHDFS processor? It retains state so that same HDFS files do not get listed multiple times. Just trying to understand your use case for using FetchHDFS without ListHDFS processor. Thanks, Matt

MattWho · ‎02-13-2023

@JohnF The NiFi Resource Identifier "/resources" exists to authorize third party authorizers like Apache Ranger to retrieve a list of all current NiFi Resource Identifiers (That returned list will change anytime some new component is added in NIFi). In a NiFi setup to use a local authorization provider (fie-access-policy-provider) this NiFi Resource Identify would not need to be used. As NiFi is already aware of all policies in its UI for setting up policies. So no need for it to be exposed. When using some external Authorizer, it would then be that Authorizer that is providing the authorizations needed to NiFi and within that external Authorizer it could authorize the "/resources" NiFi resource Identifier, if it wanted to get that listing to facilitate easier authorization policy implementation by being able to present that list of Identifiers to the end user. If you found that the provided solution(s) assisted you with your query, please take a moment to login and click Accept as Solution below each response that helped. Thank you, Matt

Online	Offline
Last Visited	‎11-18-2025 07:56 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎11-18-2025 07:56 AM
Posts	3,406
Kudos received	1619

Cloudera Community

Re: Error importing NiFi workflow template from ve...

Re: Error importing NiFi workflow template from ve...

Re: How to elevate a default nifi user to admin - ...

Re: NiFi EnvokeHTTP - putting current date on HTTP...

Re: Invoking Nifi rest api in Data Flow

Re: Connect Airflow to Nifi and access the Nifi-Ap...

Re: ERROR [Index Provenance Events-1]Write.lock er...

Re: ConvertExcelToCSVProcessor - File too Large

Re: Nifi ListSFTP cash duration

Re: Improving Nifi Perfomance

Re: Apache nifi : ListSFTP processor raise SSH aut...

Re: Apache nifi : ListSFTP processor raise SSH aut...

Re: running from current date on Nifi

Re: NiFi - fetchHDFS without ListHDFS

Re: API call to /nifi-api/resources results in "N...