Support Questions

Find answers, ask questions, and share your expertise

What are good tools / methods to get data into HDFS? How do these tools handle network failures and other errors?

avatar

Hi,

I am looking for a list of tools that can be used to transfer "large" amounts of data (1TB++; filesize usually around 10-20gb, mostly csv) from different machines in the company's network into HDFS. Sometimes the data storage is far away, so lets say we need to transfer data from Europe to the US, how do these tools handle network failures and other errors?

What are options and what are drawbacks (e.g. bottlenecks (copyFromLocal...), etc.)? Distcp? Nifi? SFtp/CopyFromLocal? Flume?

Direct vs indirect ingestion? (storage->edge->hdfs vs. storage->hdfs) I'd push local data (meaning within one datacenter) directly to HDFS, but in other cases provide an edge node for data ingestion.

How does Nifi handle network failures, is there something like FTP's resume method?

What are your experiences?

Thanks!

Jonas

1 ACCEPTED SOLUTION

avatar

Jonas, this is a great fit for NiFi for the following reasons:

  1. As you correctly assumed, network issues could be common. Building new systems around the ingest layer to handle retries, load-balancing, security, compression and encryption (because you want to transfer this data over an encrypted channel, don't you?) is way more than people want to do. NiFi has those covered out of the box, link the nodes/clusters with a site-2-site protocol. E.g. see some discussions here http://community.hortonworks.com/topics/site2site.html and ref docs.
  2. Moving this data around usually comes with a requirement of maintaining the chain of custody and keeping track of the file every step of the flow (which is, again, an inherent feature of NiFi)

View solution in original post

6 REPLIES 6

avatar

Jonas, this is a great fit for NiFi for the following reasons:

  1. As you correctly assumed, network issues could be common. Building new systems around the ingest layer to handle retries, load-balancing, security, compression and encryption (because you want to transfer this data over an encrypted channel, don't you?) is way more than people want to do. NiFi has those covered out of the box, link the nodes/clusters with a site-2-site protocol. E.g. see some discussions here http://community.hortonworks.com/topics/site2site.html and ref docs.
  2. Moving this data around usually comes with a requirement of maintaining the chain of custody and keeping track of the file every step of the flow (which is, again, an inherent feature of NiFi)

avatar

avatar

Thanks, a lot for the input. I agree Nifi is a great fit for this use case and brings a lot of features out of the box.

Thanks for filing the Jiras regarding Nifi resume and node afinity 🙂

avatar

Team decided on SFTP for now. We'll look into Nifi for the prod. system, so I am definitely looking forward to file chunking, node affinity, etc. for Nifi.

avatar
Master Guru

How do you connect SFTP? Its not supported by Distcp and I didn't want to always load it to a temp folder in the edge node. So in the end I used sshfs at customer_xyz. It worked pretty well.

avatar

Honestly I was thinking data ingestion node and temp folder 🙂 Maybe NFS Gateway would be an option, however its not really made for lots of large files and I still have to consider network failures.