I'm doing a small exercise of loading of a directory containing 43K small files (total size is almost 254 MB under 75 sub-directories) into a 3-node VM HDP cluster (1 nn 4GB RAM, 2 dn 3 GB RAM) on my MacBook pro (16 GB RAM)
The loading time is significant (33 minutes), I did not make any fine tuning for any parameters more than what is mentioned in standard installation guide for HDP 2.5. I've used "hdfs dfs -put /source /hdfs-path" command to do that Any suggestion for how to optimize loading time?
Loading a large number of files will always take quite some time to complete due to the overhead associated with putting a file to HDFS. One way you can make this run much more efficiently is to use Apache NiFi (included in Hortonworks Data Flow). With NiFi, you can use a Merge Content processor to coalesce the small files into larger files to write into HDFS.
Thanks, Emaxwell! for sharing your experience. In case of merging the small files together before loading them to HDFS, how can I process them from HDFS side (Pig, Hive...etc) , should I un-merge them first to process them? I'll appreciate if you can share some details for this point. -Mahmoud