Support Questions

Find answers, ask questions, and share your expertise

Load terabytes of data from Local system to HDFS

avatar
Expert Contributor

I'm trying to load terabytes of data from Local system to HDFS. What should be the strategy for loading files in terms performance?

1 ACCEPTED SOLUTION

avatar
Master Guru

Bulk upload? In that case an edge node with hadoop client and run hadoop fs -put commands. You can expect about 300GB/h for each hadoop put into HDFS. However you can parallelize the commands. I.e. if you have multiple files you can run multiple puts in parallel ( Essentially till you saturate the internal network of the cluster or the reading of the files from the local/network storage. )

A little bash or python script will normally do the trick. Nifi will work too obviously and might provide some retry/error handling that you would otherwise have to code yourself so it depends a bit on your requirements.

View solution in original post

3 REPLIES 3

avatar
Super Guru

I will suggest you to try nifi PutHDFS, more on this you can find here

https://community.hortonworks.com/articles/7999/apache-nifi-part-1-introduction.html

avatar
Master Guru

Bulk upload? In that case an edge node with hadoop client and run hadoop fs -put commands. You can expect about 300GB/h for each hadoop put into HDFS. However you can parallelize the commands. I.e. if you have multiple files you can run multiple puts in parallel ( Essentially till you saturate the internal network of the cluster or the reading of the files from the local/network storage. )

A little bash or python script will normally do the trick. Nifi will work too obviously and might provide some retry/error handling that you would otherwise have to code yourself so it depends a bit on your requirements.

avatar
Expert Contributor

Thanks Benjamin.Yes It is bulk upload.