Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Best approach in loading too many small file in to HDFS/Hive

Best approach in loading too many small file in to HDFS/Hive

Explorer

We have a scenario where some third party will be loading xml files in one nas directory . There will around 100K files /day (all files are less than 1 mb)and they will be loaded randomly. Currently we are loading files in batches to HDFS (5000 /for each run/every hour) , using hadoop fs -put command. After loading it to HDFS we are doing transformation and loading it to Hive. It is taking lot of now.

 

What would the best alternate approches we can use to speed up things here?

1 REPLY 1
Highlighted

Re: Best approach in loading too many small file in to HDFS/Hive

Champion

Create a spooling directory and configure your flume with custom desearlizer plugin for XML. 

when you put too many small files.you will  run into performance problem as there will be more seeks and datanode switchings ,The best approach could be to convert all those small files into one big squenctial files and stream along.  I will also encourage you to look into this Hadoop Archives  (HAR files) were introduced to HDFS in 0.18.0.

used for small files senarios.