Support Questions

Find answers, ask questions, and share your expertise
Announcements
Welcome to the upgraded Community! Read this blog to see What’s New!

How to combine Hive table files for input to mapreduce?

avatar

I have a HDP 2.0 cluster where I'm executing a mapreduce program which takes Hive(0.14) table as input. There are a large number of small files for the Hive table and hence large number of mapper containers are being requested. Please let me know if there is a way to combine small files before being input to mapreduce job?

1 ACCEPTED SOLUTION

avatar

@Phoncy Joseph

You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Or

Try using hadoop har file achieve to small file into single file.

https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files

View solution in original post

3 REPLIES 3

avatar

Are you using HiveInputFormat, its better to use CombineInputFormat which combine all small files to generate a split.

avatar

@Phoncy Joseph

You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Or

Try using hadoop har file achieve to small file into single file.

https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files

avatar
New Contributor

Are there any counters that can assess this? I am trying the above properties but failing to see reduction in the number of mappers.

Labels