Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Loading Huge Number of Small Files into HDP Small Cluster

Highlighted

Loading Huge Number of Small Files into HDP Small Cluster

Explorer

Hi Everyone,

I'm doing a small exercise of loading of a directory containing 43K small files (total size is almost 254 MB under 75 sub-directories) into a 3-node VM HDP cluster (1 nn 4GB RAM, 2 dn 3 GB RAM) on my MacBook pro (16 GB RAM)

The loading time is significant (33 minutes), I did not make any fine tuning for any parameters more than what is mentioned in standard installation guide for HDP 2.5. I've used "hdfs dfs -put /source /hdfs-path" command to do that Any suggestion for how to optimize loading time?

Mahmoud

2 REPLIES 2

Re: Loading Huge Number of Small Files into HDP Small Cluster

@msabri

Loading a large number of files will always take quite some time to complete due to the overhead associated with putting a file to HDFS. One way you can make this run much more efficiently is to use Apache NiFi (included in Hortonworks Data Flow). With NiFi, you can use a Merge Content processor to coalesce the small files into larger files to write into HDFS.

Highlighted

Re: Loading Huge Number of Small Files into HDP Small Cluster

Explorer

Thanks, Emaxwell! for sharing your experience. In case of merging the small files together before loading them to HDFS, how can I process them from HDFS side (Pig, Hive...etc) , should I un-merge them first to process them? I'll appreciate if you can share some details for this point. -Mahmoud

Don't have an account?
Coming from Hortonworks? Activate your account here