We are using NFS client (and gateway) to ingest files with 28MB of data each that we collect using SFTP (python - paramiko). We receive lots of these files per hour, 100 files each 5 minutes. These files are in gzip format and we need to check it after we receive. We use the command gzip -t <file> to check it So we do the sftp GET, store it in HDFS using a NFS client (and NFS gateway) and last, check it.
The problem here is that the time spend to check these files varies a lot. We monitor theses jobs during a day and we found that 60% of the executions happens using less than 15s (seconds) to check the files but in the other cases (40%) we spend between 15s and 2m (minutes). Using a local file system to check these files we do not spend more than a second for each file.
Do you have the same situation or know something about this slow performance of NFS service? Do you know how to troubleshoot NFS gateway to try to identify the bottleneck?
Thank you in advance.
Sagar, thank you. Our NFS is already mounted with that sync option. We will check the others parameters mentioned.
The original HortonWorks blog post that announces the NFS gateway feature is at http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/. The text itself is a bit dated and lists information right out of the documentation, but the discussion after it is very informative. If you didn't read the discussion, I suggest you do.
The NFS Gateway is just a skin over a DFS client. But this skin can grow thick with overhead when doing too many writes.
It also is not HA, and has poor support for a kerberized cluster.
The comments on that post list a way to scale the gateway to achieve throughput, which is to have several mount points using several instances of the NFS gateway and write to them in a round-robin fashion, but this just adds complexity to something that would be adopted only to make ingestion simpler.
There is no free lunch, and the added skin/complexity of the NFS gateway is not worth the trouble in my experience. Better stick with the actual DFS client and work on optimizing the file write and processing on the HDP cluster. Maybe aggregate the files through a Flume pipeline, or a custom script that does that and then copies to HDFS using the HDFS client.