Support Questions

RajbMandal · ‎06-06-2016

I'm trying to load terabytes of data from Local system to HDFS. What should be the strategy for loading files in terms performance?

bleonhardi · ‎06-06-2016

Bulk upload? In that case an edge node with hadoop client and run hadoop fs -put commands. You can expect about 300GB/h for each hadoop put into HDFS. However you can parallelize the commands. I.e. if you have multiple files you can run multiple puts in parallel ( Essentially till you saturate the internal network of the cluster or the reading of the files from the local/network storage. )

A little bash or python script will normally do the trick. Nifi will work too obviously and might provide some retry/error handling that you would otherwise have to code yourself so it depends a bit on your requirements.

View solution in original post

rajkumar_singh · ‎06-06-2016

I will suggest you to try nifi PutHDFS, more on this you can find here

https://community.hortonworks.com/articles/7999/apache-nifi-part-1-introduction.html

bleonhardi · ‎06-06-2016

Bulk upload? In that case an edge node with hadoop client and run hadoop fs -put commands. You can expect about 300GB/h for each hadoop put into HDFS. However you can parallelize the commands. I.e. if you have multiple files you can run multiple puts in parallel ( Essentially till you saturate the internal network of the cluster or the reading of the files from the local/network storage. )

A little bash or python script will normally do the trick. Nifi will work too obviously and might provide some retry/error handling that you would otherwise have to code yourself so it depends a bit on your requirements.

RajbMandal · ‎06-06-2016

Thanks Benjamin.Yes It is bulk upload.

Cloudera Community

Support Questions

Load terabytes of data from Local system to HDFS