Created 01-09-2019 05:05 AM
I'm trying to load huge data consisting of 225 GB (no. of file ~1,75,000) from SFTP server and copying data to HDFS.
To implement above scenario we've used 2 processors.
1. GetSFTP (To get the files from SFTP server)
Configured Processor -> serach recursively = true ; use Natural Ordering = true ; Remote Poll Batch Size = 20000; concurrent tasks = 3
2.PutHDFS (Pushing the data to HDFS)
Configured Processor -> concurrent tasks = 3; Confict Resolution Strategy = replace ; Hadoop Configuration Resources; Directory
But after some time data copying is getting stopped and it's size is not updating in HDFS. But I can't seem to figure out what I'm doing wrong.
Created 01-09-2019 05:25 AM
Hi @Rahul Bhargava,
can you please reduce the remote poll batch size to fewer or leave it to default vaule (which is 5000)
from documentation -
The value specifies how many file paths to find in a given directory on the remote system when doing a file listing. This value in general should not need to be modified but when polling against a remote system with a tremendous number of files this value can be critical. Setting this value too high can result very poor performance and setting it too low can cause the flow to be slower than normal.
I strongly presume that sftp is timing out from source end for the open session and causing data transfer to stale.
In addition to that could you please set the parameter : send keep alive on timeout to true
and increase all other timeout settings
Hope this helps !!
Created 01-09-2019 08:40 AM
Hi,
Thanks for your support.
We have performed the above scenario as said by you.
We have configured the settings : Increased Connection Timeout and Data Timeout, Remote Poll Batch Size = 5000.
But we are facing the same problem.
When we set the remote poll size to 1000,size of data pushed to HDFS is 1.2 GB,when set 5000 size of data pushed is 6.4 GB,when set it to 20000 data pushed is 25 GB, our data consists of 225 GB (containing 50 sub folders, no.of total files ~ 1,75,000).
So what can be the possible way for transferring the full data.
We have attached the screscreenshot-from-2019-01-09-12-59-04.pngscreenshot-from-2019-01-09-12-59-49.pngenshots for GetSFTP processor.
Created 01-15-2019 10:56 PM
Hi @Rahul Bhargava,
looks the Polling interval is causing the problem as it is waiting for 60s for next fetch but the current batch is sitll processing that went on to stale, could you please increase that to longer times (as this is the one off migration you can keep the larger value for test).
on the other note, you can go with listSftp followed by fetchSftp will do the same.
Hope this helps !!