Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Load terabytes of data from Local system to HDFS

Rising Star

I'm trying to load terabytes of data from Local system to HDFS. What should be the strategy for loading files in terms performance?

1 ACCEPTED SOLUTION

Bulk upload? In that case an edge node with hadoop client and run hadoop fs -put commands. You can expect about 300GB/h for each hadoop put into HDFS. However you can parallelize the commands. I.e. if you have multiple files you can run multiple puts in parallel ( Essentially till you saturate the internal network of the cluster or the reading of the files from the local/network storage. )

A little bash or python script will normally do the trick. Nifi will work too obviously and might provide some retry/error handling that you would otherwise have to code yourself so it depends a bit on your requirements.

View solution in original post

3 REPLIES 3

I will suggest you to try nifi PutHDFS, more on this you can find here

https://community.hortonworks.com/articles/7999/apache-nifi-part-1-introduction.html

Bulk upload? In that case an edge node with hadoop client and run hadoop fs -put commands. You can expect about 300GB/h for each hadoop put into HDFS. However you can parallelize the commands. I.e. if you have multiple files you can run multiple puts in parallel ( Essentially till you saturate the internal network of the cluster or the reading of the files from the local/network storage. )

A little bash or python script will normally do the trick. Nifi will work too obviously and might provide some retry/error handling that you would otherwise have to code yourself so it depends a bit on your requirements.

Rising Star

Thanks Benjamin.Yes It is bulk upload.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.