Support Questions

Find answers, ask questions, and share your expertise

How to combine Hive table files for input to mapreduce?

avatar
Rising Star

I have a HDP 2.0 cluster where I'm executing a mapreduce program which takes Hive(0.14) table as input. There are a large number of small files for the Hive table and hence large number of mapper containers are being requested. Please let me know if there is a way to combine small files before being input to mapreduce job?

1 ACCEPTED SOLUTION

avatar
Super Guru

@Phoncy Joseph

You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Or

Try using hadoop har file achieve to small file into single file.

https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files

View solution in original post

3 REPLIES 3

avatar
Super Guru

Are you using HiveInputFormat, its better to use CombineInputFormat which combine all small files to generate a split.

avatar
Super Guru

@Phoncy Joseph

You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Or

Try using hadoop har file achieve to small file into single file.

https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files

avatar
New Contributor

Are there any counters that can assess this? I am trying the above properties but failing to see reduction in the number of mappers.