Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What are good tools / methods to get data into HDFS? How do these tools handle network failures and other errors?

Solved Go to solution

What are good tools / methods to get data into HDFS? How do these tools handle network failures and other errors?

Hi,

I am looking for a list of tools that can be used to transfer "large" amounts of data (1TB++; filesize usually around 10-20gb, mostly csv) from different machines in the company's network into HDFS. Sometimes the data storage is far away, so lets say we need to transfer data from Europe to the US, how do these tools handle network failures and other errors?

What are options and what are drawbacks (e.g. bottlenecks (copyFromLocal...), etc.)? Distcp? Nifi? SFtp/CopyFromLocal? Flume?

Direct vs indirect ingestion? (storage->edge->hdfs vs. storage->hdfs) I'd push local data (meaning within one datacenter) directly to HDFS, but in other cases provide an edge node for data ingestion.

How does Nifi handle network failures, is there something like FTP's resume method?

What are your experiences?

Thanks!

Jonas

1 ACCEPTED SOLUTION

Accepted Solutions

Re: What are good tools / methods to get data into HDFS? How do these tools handle network failures and other errors?

Jonas, this is a great fit for NiFi for the following reasons:

  1. As you correctly assumed, network issues could be common. Building new systems around the ingest layer to handle retries, load-balancing, security, compression and encryption (because you want to transfer this data over an encrypted channel, don't you?) is way more than people want to do. NiFi has those covered out of the box, link the nodes/clusters with a site-2-site protocol. E.g. see some discussions here http://community.hortonworks.com/topics/site2site.html and ref docs.
  2. Moving this data around usually comes with a requirement of maintaining the chain of custody and keeping track of the file every step of the flow (which is, again, an inherent feature of NiFi)
6 REPLIES 6

Re: What are good tools / methods to get data into HDFS? How do these tools handle network failures and other errors?

Jonas, this is a great fit for NiFi for the following reasons:

  1. As you correctly assumed, network issues could be common. Building new systems around the ingest layer to handle retries, load-balancing, security, compression and encryption (because you want to transfer this data over an encrypted channel, don't you?) is way more than people want to do. NiFi has those covered out of the box, link the nodes/clusters with a site-2-site protocol. E.g. see some discussions here http://community.hortonworks.com/topics/site2site.html and ref docs.
  2. Moving this data around usually comes with a requirement of maintaining the chain of custody and keeping track of the file every step of the flow (which is, again, an inherent feature of NiFi)

Re: What are good tools / methods to get data into HDFS? How do these tools handle network failures and other errors?

Re: What are good tools / methods to get data into HDFS? How do these tools handle network failures and other errors?

Thanks, a lot for the input. I agree Nifi is a great fit for this use case and brings a lot of features out of the box.

Thanks for filing the Jiras regarding Nifi resume and node afinity :)

Re: What are good tools / methods to get data into HDFS? How do these tools handle network failures and other errors?

Team decided on SFTP for now. We'll look into Nifi for the prod. system, so I am definitely looking forward to file chunking, node affinity, etc. for Nifi.

Re: What are good tools / methods to get data into HDFS? How do these tools handle network failures and other errors?

How do you connect SFTP? Its not supported by Distcp and I didn't want to always load it to a temp folder in the edge node. So in the end I used sshfs at customer_xyz. It worked pretty well.

Re: What are good tools / methods to get data into HDFS? How do these tools handle network failures and other errors?

Honestly I was thinking data ingestion node and temp folder :) Maybe NFS Gateway would be an option, however its not really made for lots of large files and I still have to consider network failures.