Created 11-11-2015 09:23 AM
Hi,
I am looking for a list of tools that can be used to transfer "large" amounts of data (1TB++; filesize usually around 10-20gb, mostly csv) from different machines in the company's network into HDFS. Sometimes the data storage is far away, so lets say we need to transfer data from Europe to the US, how do these tools handle network failures and other errors?
What are options and what are drawbacks (e.g. bottlenecks (copyFromLocal...), etc.)? Distcp? Nifi? SFtp/CopyFromLocal? Flume?
Direct vs indirect ingestion? (storage->edge->hdfs vs. storage->hdfs) I'd push local data (meaning within one datacenter) directly to HDFS, but in other cases provide an edge node for data ingestion.
How does Nifi handle network failures, is there something like FTP's resume method?
What are your experiences?
Thanks!
Jonas
Created 11-11-2015 12:53 PM
Jonas, this is a great fit for NiFi for the following reasons:
Created 11-11-2015 12:53 PM
Jonas, this is a great fit for NiFi for the following reasons:
Created 11-11-2015 01:40 PM
For a transparent resume feature follow these:
https://issues.apache.org/jira/browse/NIFI-1149
Created 11-11-2015 03:15 PM
Thanks, a lot for the input. I agree Nifi is a great fit for this use case and brings a lot of features out of the box.
Thanks for filing the Jiras regarding Nifi resume and node afinity 🙂
Created 11-12-2015 07:04 AM
Team decided on SFTP for now. We'll look into Nifi for the prod. system, so I am definitely looking forward to file chunking, node affinity, etc. for Nifi.
Created 11-12-2015 01:04 PM
How do you connect SFTP? Its not supported by Distcp and I didn't want to always load it to a temp folder in the edge node. So in the end I used sshfs at customer_xyz. It worked pretty well.
Created 11-12-2015 01:36 PM
Honestly I was thinking data ingestion node and temp folder 🙂 Maybe NFS Gateway would be an option, however its not really made for lots of large files and I still have to consider network failures.