Created on 06-06-2016 06:21 AM - edited 09-16-2022 03:23 AM
I'm trying to load terabytes of data from Local system to HDFS. What should be the strategy for loading files in terms performance?
Created 06-06-2016 08:19 AM
Bulk upload? In that case an edge node with hadoop client and run hadoop fs -put commands. You can expect about 300GB/h for each hadoop put into HDFS. However you can parallelize the commands. I.e. if you have multiple files you can run multiple puts in parallel ( Essentially till you saturate the internal network of the cluster or the reading of the files from the local/network storage. )
A little bash or python script will normally do the trick. Nifi will work too obviously and might provide some retry/error handling that you would otherwise have to code yourself so it depends a bit on your requirements.
Created 06-06-2016 07:24 AM
I will suggest you to try nifi PutHDFS, more on this you can find here
https://community.hortonworks.com/articles/7999/apache-nifi-part-1-introduction.html
Created 06-06-2016 08:19 AM
Bulk upload? In that case an edge node with hadoop client and run hadoop fs -put commands. You can expect about 300GB/h for each hadoop put into HDFS. However you can parallelize the commands. I.e. if you have multiple files you can run multiple puts in parallel ( Essentially till you saturate the internal network of the cluster or the reading of the files from the local/network storage. )
A little bash or python script will normally do the trick. Nifi will work too obviously and might provide some retry/error handling that you would otherwise have to code yourself so it depends a bit on your requirements.
Created 06-06-2016 08:45 AM
Thanks Benjamin.Yes It is bulk upload.