About MattWho

MattWho · ‎06-04-2018

@Artem Anokhin No matter which host URL(s) you use in the configuration of an Remote Process Group (RPG). the RPG will ned up retrieving site-to-site (S2S) details that include all the currently connected nodes in the target cluster. - Included in those S2S details are things like: 1. Hostname of each node - defined by "nifi.remote.input.host= " configured on each node. 2. If "Raw" transport protocol is supported - defined by this property "nifi.remote.input.socket.port=" being set. 3. if "HTTP" transport protocol is supported - defined by this property "nifi.remote.input.http.enabled=" being set to true or false 4. If S2S connection is secure - defined by "nifi.remote.input.secure=" being set to true or false - There is no way to create "node groups" that would only be returned to a source NiFi during the retrieve S2S details phase of communications. - Thank you, Matt

MattWho · ‎05-30-2018

@Siddharth Sa From your image it appears that you are auto terminating the failure relationship on your putSQL processor? - Assuming the misconfiguration in your updateAttribute processor resulting in failure of every FlowFile passed to the putSQL, those FlowFiles would have all been routed to the failure relationship of the putSQL processor. It is rare that user would auto terminate a "failure" relationship as it means data is being deleted. A more typical design is to route "failure" relationships to a dumb processor that is not enabled (like an updateAttribute processor or even a funnel). This would have allowed you to redirect the connection containing that failure relationship back to your fixed updateAttribute processor resulting in all the failed data being reprocessed. - NiFi does if enable archive FlowFiles based on configured thresholds. It is possible to perform a provenance search on the FlowFiles with a "DROP" event recorded by the putSQL processor. The drop event would occur for each FlowFile routed to failure and deleted by putSQL. While not elegant, you may be able to select each failed FlowFile one by one, open the lineage, and replay the FlowFile at the "updateAttribute" point in the lineage history. You would control the sequence of processing by the older in which you replay each FlowFile. There is no bulk replay capability. - Thank you, Matt

MattWho · ‎05-22-2018

@Dilip Namdev The fact that every "archive" sub-directory is empty leads me to believe that archive is in fact working correctly. NiFi stores FlowFile content in claims within the content repository. One claim may contain 1 to many Flowfiles. All it takes is one FlowFile to still be active in one of your dataflows (queued in some NiFi connection) to hold up an entire content claim. A content claim cannot be moved to archive unless all active flowfiles referencing that claim are complete (meaning reached a point of termination in your dataflow). - The following article explains this in more detail: https://community.hortonworks.com/articles/82308/understanding-how-nifis-content-repository-archivi.html - Aside from the above, NiFi opens a lot of file handles. Having insufficient file handles can cause issues with creation of new files. This may affect proper cleanup of both the flowfile and content repositories. I suggest making sure the user that owns your NiFi process has a high number of open file handles available to it. - Thanks, Matt - If you found this answer addressed your question, please take moment to login and click "accept" below the answer

MattWho · ‎05-18-2018

@Sharoon Babu NiFi processors like these execute against FlowFiles on inbound connections to the processor. The FlowFile is only removed from the inbound connection when that code execution results in that FlowFile being transitioned to an outbound connection. - There are two types of scenarios here: 1. NiFi is shutdown or dies in the middle of a processors execution. This means the FlowFile was never transferred to an outbound connection. When NiFi is restarted, NiFi will reload FlowFiles in to the last connection they were recorded as belong to. In this case that would be an inbound connection. The consuming processor of that connection will then be scheduled to run/execute again. Processors do not record and intermediate phase fo processing and thus will begin executing against the entire FlowFile again. - 2. Some network failure results in execution not being able to complete. NiFi processors should acknowledge failures in such case which would result in the the FlowFile(s) being moved from the inbound connection to an outbound connection (like a "failure" relationship). It is the responsibility of the dataflow designer to account for such unexpected failures and route those outbound failure relationships accordingly. Often times failure type relationships may be just looped back on the same processor for retry. Wherever this FlowFile is routed (even if in a loop), Execution will again be against the entire Flowfiles content again. - The target systems should handle such scenarios and not except unconfirmed file transfers. - For example: PutFile will write the file using a "dot" rename strategy. The FlowFiles content is originally written as a ".<filename>" and then upon successful completion of writing the data, the filename is renamed from ".<filename>" to just "<filename>". Since dot files are in most cases considered hidden files and ignored by source systems that incomplete transfer would be ignored by destination system. Upon recover and re-attempt (depending on processor configuration) NiFi will repeat this process. - There are some unavoidable scenarios that at times can lead to some data duplication. Considering NiFi's design architecture, NiFi has always favored data duplication over data loss in such rare scenarios. - Thank you, Matt - If you found this answer has addressed your question, please take a moment to log in and click the "accept" link on the answer.

MattWho · ‎05-17-2018

@Takefumi OIDE Additional performance and best practice recommendations: https://community.hortonworks.com/articles/184990/dissecting-the-nifi-connection-heap-usage-and-perf.html https://community.hortonworks.com/articles/184786/hdfnifi-improving-the-performance-of-your-ui.html https://community.hortonworks.com/content/kbentry/109629/how-to-achieve-better-load-balancing-using-nifis-s.html - And just for knowledge relevant to NiFi Content handling: https://community.hortonworks.com/articles/82308/understanding-how-nifis-content-repository-archivi.html

MattWho · ‎05-14-2018

@Tarek Elgamal Assuming you are referring to settings for "Max Timer Driven Thread count"? That setting controls the max number of threads that can execute at one time. Does not guarantee any order to the execution of threads. NiFi's controller in the back ground does not operate under this thread pool. Both processors will be scheduled to run based on their configured run schedule. Those concurrent tasks then get stacked in a request queue waiting on one of the threads from that pool to service them. This way, every processor is eventually going to get a chance to run thier code. Also keep in mind that some processors work on batches of FlowFiles while others process one FlowFile per task. Also hard to say that each processed FlowFile will take same amount of time to complete an operation. Really depends on processor and what it is designed to do. Thanks, Matt

MattWho · ‎05-07-2018

@John T The ListenHTTP processor works just like any one of our other Listen based processors. This processor should be configured to run on every node. That way every node can receive data. The listen based processors are configured to Listen on a specific port. So the endpoint for a listenHTTP would be something like: - http(s)://<hostname>:<listenerport>/<base path> - You could have an external load-balancer that is configured to receive all your inbound traffic and load-balance it all the node sin the NiFi cluster. - You could also install NiFi or MiNiFi at each of your data sources and use NiFi's Site-To-Site (S2S) protocol to load-balance the delivery of FlowFiles to this target cluster. - Listen based processors are not ideal for the Listen (primary node) --> RPG (S2S) --> input port (all nodes) --> rest of dataflow model. Tat is because the Listen based processor receive the entire payload. This means your primary node has to handle a lot of writes to content repo (all data) before then sending that data across the network to other nodes (redistribution). can be an expensive waste of resources. That is why load-balancing with this type of processor is better done out front of NiFi. - Thanks, Matt

MattWho · ‎05-07-2018

@John T NiFi is a very difficult things to make a one size fits all sizing recommendation for. NiFi does not typically scale linearly. This is why you see the hardware specs exponentially increase as throughput increases. This is based on the fact that typical NiFi workflows all grow exponentially in size and complexity as the volume of throughput increases in most cases. More and more workflows are added. - Different NiFi processors in different workflows contribute to different server resource usage. That resource usage varies based processor configuration and FlowFile volume. So even two workflows using same processors may have different sizing needs. - How well a NiFi is going to perform has a lot to do with the workflow the user has build. After all it is this user designed work flow that is going to be using the majority of the resources on each node. - Best answer to be honest is to built your workflows and stress test them. This kind of a modeling and simulation setup. Learn the boundaries your workflows put on your hardware. At what data volume point does CPU utilization, network bandwidth, memory load, disk IO become my bottleneck for my specific workflow(s). Tweaking your workflows and component configurations. Then scale out by adding more nodes allowing some headroom considering it is very unlikely ever node will be processing the exact same number of NiFi FlowFiles all the time. - There are numerous way to handle load-balancing. It really depends on your dataflow design choices on how you intend to get data in to your NiFi. Keep in mind that each Nifi nodes in a cluster runs its own copy of the dataflows you build, has their own set of repositories, and thus works on their owns sets of FlowFiles. - While using NiFi's listener type processors would benefit from an external load-balancer to direct that incoming data across all nodes, processors like ConsumeKafka can run on all nodes consuming from same topic (assuming balanced number of Kafka partitions) - Other protocols like SFTP are not cluster friendly. So in dataflows like that you can only have something like ListenSFTP processor running on only one node at any given time. To achieve load-balancing there, a flow typically looks like: ListenSFTP (configured to run primary node only) ---> Remote Process Group (used to re-distribute/load-balance 0 byte FlowFiles to rest of nodes) --> input port --> FetchSFTP (Pulls content for each FlowFile). - One thing you do not want to do in most cases is load-balance the NiFi UI. You can do this but need to make sure you use sticky sessions in your load-balancer here. The tokens issued for user authentication (ldap or kerberos) are only good for node that issued them to user so subsequent requests must go to same node. - Hope this gives you some direction. - Thanks, Matt

MattWho · ‎05-04-2018

@Benjamin Bouret - It is common for the cluster Coordinator and Primary Node to change from time to time in a NiFi cluster, so you need to careful when designing your flows that utilize processors running "Primary node" only to make sure processing can still continue when a switch occurs. - I am going to assume since you got duplicates here that the local directory you have your GetFile processor pointing at is mounted across all your NiFi nodes. In order to avoid duplicates you will need to use processors that support state. The GetFile processor is one of our original processors that was developed before state management was put in place. It has been deprecated in favor of the newer listFile and FetchFile processors. The ListFile processor has the ability to store state either local to each node (not shared for cases where each node is pulling from its own non shared directory) or cluster state (state is stored in zookeeper where same processor on every node has access to it). Cluster state here would allow you to run this processor against a shared mount to all you nodes in "Primary node" only setup. If primary node changes the new primary node will start this processor and pull the last known recorded cluster state before performing a new listing. This should greatly reduce the likelihood of seeing duplicates. - NiFi will favor duplicate data over lost data. So there will still exist a small window of opportunity where duplication could occur. For example original primary node ingested data but some network issue for example prevented last state to be written to zookeeper. The new node would then not get the most current state which may result in duplication. - The list/fetch processor model also allows you to spread the workload across your cluster more easily. A flow would consist of: listFile (Scheduled primary node only) --> Remote Process group (configured to point back at cluster to redistributed listed files) ---> fetchFile ( running on all nodes to retrieve content of listed files) --> rest of flow... - Thanks, Matt - If you found this answer addressed your question, please take a moment to login to the forum and click "accept" on the answer.

MattWho · ‎04-26-2018

@Rahul Soni @Gillu Varghese The GenerateFlowFile processor will create 1 GB of content for each FlowFile it creates. The FlowFile content does not live in heap memory space. - Each generated FlowFile will have a core set of FlowFile Attributes created. For example: -------------------------------------------------- Standard FlowFile Attributes Key: 'entryDate' Value: 'Thu Apr 26 14:52:12 UTC 2018' Key: 'lineageStartDate' Value: 'Thu Apr 26 14:52:12 UTC 2018' Key: 'fileSize' Value: '67' FlowFile Attribute Map Content Key: 'filename' Value: '7409235136254821' Key: 'path' Value: './' Key: 'uuid' Value: '119b16a1-7cb2-40ff-b92e-77bc733389e6' -------------------------------------------------- - You can however define attributes on attributes on each generated FlowFile by adding attributes via custom properties in GenerateFlowFile processor: FileSize in the case of heap does not matter when it comes to heap usage of queued FlowFiles. In this way you can create FlowFiles with as many attributes as you want: -------------------------------------------------- Standard FlowFile Attributes Key: 'entryDate' Value: 'Thu Apr 26 14:55:54 UTC 2018' Key: 'lineageStartDate' Value: 'Thu Apr 26 14:55:54 UTC 2018' Key: 'fileSize' Value: '0' FlowFile Attribute Map Content Key: 'attr1' Value: 'This is a test' Key: 'attr2' Value: 'This is a test' Key: 'attr3' Value: 'This is a test' Key: 'attr4' Value: 'This is a test' Key: 'filename' Value: '7409457340083769' Key: 'path' Value: './' Key: 'uuid' Value: 'f6254149-be47-46f3-a659-c5126ae80481' -------------------------------------------------- - You can adjust the run schedule and batch setting to control the number of new FlowFiles generated over a specific time period. - For example: Setting run Schedule to 5 sec and Batch Size to 1000, Every 5 seconds this procesor will produce 1000 new FlowFiles. - Thanks, Matt

Online	Offline
Last Visited	‎07-09-2026 06:21 AM

Member Since	‎07-30-2019 10:41 AM
Last Visited	‎07-09-2026 06:21 AM
Posts	3,472
Kudos received	1638

Cloudera Community

Re: ListenNetFlow processor does not decode Cisco ...

Re: Can we detect who did a particular operation i...

Re: How to invoke a url in nifi which is protected...

Re: Retry impacts scheduler

Re: 503 error while copying/versioning big process...

Re: Is there a way to group nodes in a cluster and...

Re: How to process failed records in CDC?

Re: NiFI Content Repository archival is not workin...

Re: What if nifi fails to write data?

Re: HDF/NIFI Best practices for setting up a high ...

Re: How to improve nifi concurrency

Re: 40 Gbps NiFi Cluster

Re: 40 Gbps NiFi Cluster

Re: Duplicate of flowfile after NiFi primary node ...

Re: flowfile attributes in Nifi