We have a scenario where some third party will be loading xml files in one nas directory . There will around 100K files /day (all files are less than 1 mb)and they will be loaded randomly. Currently we are loading files in batches to HDFS (5000 /for each run/every hour) , using hadoop fs -put command. After loading it to HDFS we are doing transformation and loading it to Hive. It is taking lot of now.
What would the best alternate approches we can use to speed up things here?
Create a spooling directory and configure your flume with custom desearlizer plugin for XML.
when you put too many small files.you will run into performance problem as there will be more seeks and datanode switchings ,The best approach could be to convert all those small files into one big squenctial files and stream along. I will also encourage you to look into this Hadoop Archives (HAR files) were introduced to HDFS in 0.18.0.
used for small files senarios.