About MattWho

MattWho · ‎06-11-2018

@yazeed salem - Please correct me if my below understanding of what you are asking is not correct: - There is no way to click on an existing connection with queued data and select/copy the queued FlowFiles from that connection to another connection. - If you want to join to different connections together, you can use a "funnel". With the processors on both ends of a connection stopped, you will be able to click on that connection and drag the small blue square at destination side of connection to another end-point. This will cause any connections feeding into the funnel to be funneled into a common destination queue. - Thank you, Matt - When an "Answer" addresses/solves your question, please select "Accept" beneath that answer. This encourages user participation in this forum.

MattWho · ‎06-11-2018

@John T - Many factors go in to determining the correct hardware configuration. Most of which comes from the size or you data, amount of data, and specific processors used in your dataflow and how they are configured. - What I can tell you is running NiFi with a 300 GB configured heap is probably not the best idea. At a heap of that size, even partial garbage collection events could result in considerable stop-the-world times. - I would tend to lean more towards the smaller VMs to make better use of your hardware resources, but again that is based on very little knowledge of your specifics. Best to standup a small VM like you described and perform some load testing on your specific dataflows to determine how much data each VM could potentially process and scale from there on how many nodes you will actually need. - Also keep in mind that the memory used by NiFi's dataflows can and often does extend beyond just heap. Make sure you do not allocate to much memory to each VMs heap which may result in server issues due to insufficient memory for the OS and not heap related processing. (For example: think along the lines of OS level socket queues, externally issued commands and any scripting based processors you may use) - NiFi clusters in the range of 40 nodes is fairly uncommon; however, NiFi doe snot put a restriction on the number of nodes you can have in a cluster. Just make sure that as your increase the number of nodes you make adjustments to the Cluster node properties to maintain cluster stability. Most specifically nifi.cluster.node.connection.timeout (60 secs or higher), nifi.cluster.node.read.timeout (60 seconds or higher), and nifi.cluster.protocol.heartbeat.interval (20 seconds or higher). - As far as GC goes, the default G1GC has proven to be good performer. I have heard of some rare corner cases that can cause some stability issues and users have resolved that by commenting out the G1GC line in nifi bootstrap.conf file and just going with the default GC in the latest versions of Java. - Hope this information was useful for you, Matt

MattWho · ‎06-08-2018

https://nifi.apache.org/docs/nifi-docs/html/administration-guide.html - NiFi even provides a toolkit you can use to create your own certificates/keystores for each of your NiFi nodes. - Matt

MattWho · ‎06-07-2018

@Bhushan Kandalkar I was afraid of that. Ranger does not allow wildcards in the user names. From a security standpoint it is generally a bad idea to create a server certificate that uses wildcards. In order to use Ranger as your authorizer, you are going to need to create new NiFi node certificates/keystores that do not use wildcards in the "Owner" DN. - This means you will have a unique keystore for each of your NiFi nodes (which is a security best practice). You will then need to authorize each of those nodes with /proxy. - Thanks, Matt

MattWho · ‎06-07-2018

@Bhushan Kandalkar That is correct.

MattWho · ‎06-07-2018

@Bhushan Kandalka - Once the Ranger plugin is enabled, the authorizations.xml file is no longer used to determine what authorizations both users and Nifi nodes have. In a NiFi cluster each node must be authorized to act as a proxy so that requests made by users logged in to any one of the nodes's UIs can be replicated to the other nodes. This means that you will need to set an authorization policy in Ranger that authorizes "CN=*.test.com, OU=NIFI" against the "/proxy" policy. - Thank you, Matt

MattWho · ‎06-07-2018

@Henrik Olsen The FetchSFTP will make a separate connection for each file being retrieved. Concurrent Tasks will allow you to specify the number of concurrent connections allowing more then one file to retrieved per processor execution schedule (still one file per connection). - Yes, HDF 3.1 will have all these goodies. Suggest skipping directly to HDF 3.1.2 which was just released since it has a loyt of fixes for some annoying bugs in HDF 3.1 and 3.1.1. - You will have the option to use either an external REDIS configured how you like or an internal NiFi DistributedMapCacheServer with the WAIT and NOTIFY processors. - The DistributedMapCacheServer provides the following configurations: There is no TTL for the DistributedMapCacheServer option. - There also isn't a processor that will dump out the current contents of the DistirbutedMapCacheServer, but you should be able to write a script that can do that for you. Here is an example script that is used to remove a cached entry: https://gist.github.com/ijokarumawak/14d560fec5a052b3a157b38a11955772 - I do not know a lot about REDIS, but as an externally managed cache service, it probably will give you a lot more options as well as a cluster capability so you don't have a single point of failure like you would have with the DistributedMapCacheServer. - Thank you, Matt

MattWho · ‎06-05-2018

@Henrik Olsen The FetchSFTP processor is deprecated in favor of the ListSFTP/FetchSFTP processors. The list/fetch model is works better in a NiFi cluster type configuration. Both the GetSFTP and ListSFTP processor should only ever be run on "primary node" only when used in a NiFi cluster. FetchSFTP should be configured to run on all nodes. - That being said, the GetSFTP will retrieve up to the configured "Max Selects" in a single connection. The ListSFTP will return the filenames of all files in a single connection. (The 0 byte Flowfiles generated from listSFTP should be routed to a Remote Process Group that will redistribute those 0 Byte FlowFiles to all nodes in the cluster where FetchSFTP will retrieve the actual content. - Regardless of how you retrieve the files, you are looking for a way to only process those files where you also retrieved the corresponding sha256 file. This can be accomplished using the Wait and Notify processors: In the above flow I have all the retrieved data (both <datafile> and <datafile>.sha256 files) coming in to a RouteOnAttribute processor. I route all the <datafile>.sha256 FlowFiles to a Notify processor. (in my test I had 20 <datafile> files and only 16 corresponding <datafile>.sha256 files). The Notify processor is configured to write the ${filename} to a DistributeMapCache service that every node in my cluster can access. My Wait processor is then designed to check that same DistributedMapCache service looking for "${filename}.sha256". If a match is found the Wait processor will release the <datafile> to the success relationship for further processing in your dataflow. The Wait processor is also configured to wait on so long looking for a match. So you see in my example that after the length of time the 4 Flowfiles that did not have a matching sha256 filename in the cache were routed to "expired" relationship. Set expiration high enough to allot for the time needed to retrieve both files. - Thank you, Matt - If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer.

MattWho · ‎06-05-2018

@Artem Anokhin If you found this Answer addressed your original question, please take a moment to login and click "Accept" below the answer. *** Forum tip: Pleasse try to avoid responding to an Answer by starting a new answer. Instead use the "add comment" tp respond to en existing answer. There is no guaranteed order to different answers which can make following a response thread difficult especially when multiple people are trying to assist you.

MattWho · ‎06-05-2018

@Shu @Raja M I just want to correct one thing. There is no default prioritizer when none are selected. The "OldestFlowFileFirstPrioritizer" while it may appear in many cases as the default behavior you see it is purely coincidental. By default the order in which Flowfiles are processed from a queue is performance based. This means Flowfiles are processed in a order that best makes use of disk performance to minimize disk seeks. (So think of this as processed in order of written to disk.) In many cases this acts like oldestFlowFileFirst, but that can change if FlowFiles in a connection come from multiple sources flows. - Enforcing the order of FlowFile processing in NiFi can be challenging. Some processors work on batches of Files while other works on one FlowFile a ta time. FlowFiles routed down different paths are processed with no consideration of FlowFiles processed down a different path. Concurrent tasks on processors allow for concurrent execution of a processor (each task works on its own FlowFile.) with some FlowFile being processed faster then others making them complete out of order. Some processor may fail to complete a task for one reason or another in normal operations (FetchFile retrieving content and network issue causes connection to drop. FlowFile is penalized and routed to "failure" relationship. FetchFile moves on to next FlowFile and retries the failed FlowFile if Failure is looped back and once penalty expires. Now these Flowfiles are out of order). - NiFi was designed for speed at its core with the intent of each processor to work on FlowFIles it recieved with out needing tio care about other FlowFiles in any other queues. - There are a few processors introduced that may be used to help in your dataflow design to achieve this goal. Keep in mind that any enforcement of order is going to affect throughput of your NiFi because of the overhead introduced in doing so. You will want to take a look at the following processors: 1 EnforceOrder <-- This processor works well fo numerically order Flowfiles which timestamps are not going to provide. - 2 Wait and notify. <-- This allows you to enforce the processing of one FlowFile at a time in order. ----- Upon listing your Flowfiles, you would feed a wait processor. This processor could release one FlowFile in o the rest of your dataflow (FetchFile...etc...) and finally the notify processor, once processing of the FlowFile was successful. The notify would then trigger the Wait processor to release next FlowFile. (Set OldestFlowFileFirst prioritizer on connection between ListFile and Wait processors) - Thank you, Matt

Online	Offline
Last Visited	‎07-09-2026 07:26 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎07-09-2026 07:26 AM
Posts	3,472
Kudos received	1638

Cloudera Community

Re: ListenNetFlow processor does not decode Cisco ...

Re: Can we detect who did a particular operation i...

Re: How to invoke a url in nifi which is protected...

Re: Retry impacts scheduler

Re: 503 error while copying/versioning big process...

Re: Move FlowFiles between Queues/Relationships

Re: Best NiFi Heap usage performance for Large Ser...

Re: Nifi Integration with Ranger Not Working

Re: Nifi Integration with Ranger Not Working

Re: Nifi Integration with Ranger Not Working

Re: Nifi Integration with Ranger Not Working

Re: FetchSFTP and reuse of connection

Re: FetchSFTP and reuse of connection

Re: Is there a way to group nodes in a cluster and...

Re: ListFile to list all the files sorted by date...