Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to combine Hive table files for input to mapreduce?

avatar
Rising Star

I have a HDP 2.0 cluster where I'm executing a mapreduce program which takes Hive(0.14) table as input. There are a large number of small files for the Hive table and hence large number of mapper containers are being requested. Please let me know if there is a way to combine small files before being input to mapreduce job?

1 ACCEPTED SOLUTION

avatar
Super Guru

@Phoncy Joseph

You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Or

Try using hadoop har file achieve to small file into single file.

https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files

View solution in original post

3 REPLIES 3

avatar
Super Guru

Are you using HiveInputFormat, its better to use CombineInputFormat which combine all small files to generate a split.

avatar
Super Guru

@Phoncy Joseph

You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Or

Try using hadoop har file achieve to small file into single file.

https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files

avatar
New Contributor

Are there any counters that can assess this? I am trying the above properties but failing to see reduction in the number of mappers.