can you please reduce the remote poll batch size to fewer or leave it to default vaule (which is 5000)
from documentation -
The value specifies how many file paths to find in a given directory on the remote system when doing a file listing. This value in general should not need to be modified but when polling against a remote system with a tremendous number of files this value can be critical. Setting this value too high can result very poor performance and setting it too low can cause the flow to be slower than normal.
I strongly presume that sftp is timing out from source end for the open session and causing data transfer to stale.
In addition to that could you please set the parameter : send keep alive on timeout to true
We have performed the above scenario as said by you.
We have configured the settings : Increased Connection Timeout and Data Timeout, Remote Poll Batch Size = 5000.
But we are facing the same problem.
When we set the remote poll size to 1000,size of data pushed to HDFS is 1.2 GB,when set 5000 size of data pushed is 6.4 GB,when set it to 20000 data pushed is 25 GB, our data consists of 225 GB (containing 50 sub folders, no.of total files ~ 1,75,000).
So what can be the possible way for transferring the full data.
looks the Polling interval is causing the problem as it is waiting for 60s for next fetch but the current batch is sitll processing that went on to stale, could you please increase that to longer times (as this is the one off migration you can keep the larger value for test).
on the other note, you can go with listSftp followed by fetchSftp will do the same.