Support Questions

Find answers, ask questions, and share your expertise

GETSFTP with NiFi cluster

avatar
Rising Star

I'm using NiFi 1.6.0, in a 3 node cluster.

When I use GETSFTP (Set to ALL Nodes) in a clustered nifi the cluster seems to distribute the data acquired evenly among nodes.

Does this mean that all 3 servers GetSFTP the data evenly?

I also tried using FETCH SFTP to get the listings and then did a site to site, back to my own cluster and It did NOT distribute the Fetch 0 byte files evenly among the nodes for the fetch SFTP load to be evenly distributed.

What would be the best practice to Load Balance SFTPGET in a nifi cluster?

John

1 ACCEPTED SOLUTION

avatar

Hi @John T

When you use GetSFTP in a cluster you are duplicating your data. Each node will ingest the same data.

You need to use List/Fetch pattern. A great description of this feature is available here : https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/

Now if you used the List/Fetch pattern correctly and don't have even data distribution, you need to understand that Site-to-Site protocol does batching to have better network performance. This means that if you have 3 flow files of few KB or MB to send, NiFi decides to send them to one node rather than using 3 connection. The decision is take based on data size, number of flow files and transmission duration. Because of this, you don't get data distributed when you are doing tests. Usually you test with few small files.

The batching threshold is by default but you can change it for each input port. Go to RPG, Input ports then click on the edit pen for your input port and you get this settings

77659-screen-shot-2018-06-13-at-95116-am.png

77660-screen-shot-2018-06-13-at-95136-am.png

I hope this helps understand the behavior.

Thanks

View solution in original post

4 REPLIES 4

avatar
Master Guru

Do you mean you tried ListSFTP to get the listings?

avatar
Master Guru

avatar

Hi @John T

When you use GetSFTP in a cluster you are duplicating your data. Each node will ingest the same data.

You need to use List/Fetch pattern. A great description of this feature is available here : https://pierrevillard.com/2017/02/23/listfetch-pattern-and-remote-process-group-in-apache-nifi/

Now if you used the List/Fetch pattern correctly and don't have even data distribution, you need to understand that Site-to-Site protocol does batching to have better network performance. This means that if you have 3 flow files of few KB or MB to send, NiFi decides to send them to one node rather than using 3 connection. The decision is take based on data size, number of flow files and transmission duration. Because of this, you don't get data distributed when you are doing tests. Usually you test with few small files.

The batching threshold is by default but you can change it for each input port. Go to RPG, Input ports then click on the edit pen for your input port and you get this settings

77659-screen-shot-2018-06-13-at-95116-am.png

77660-screen-shot-2018-06-13-at-95136-am.png

I hope this helps understand the behavior.

Thanks

avatar
Master Mentor