Support Questions

phoncy_joseph · ‎06-15-2016

I have a HDP 2.0 cluster where I'm executing a mapreduce program which takes Hive(0.14) table as input. There are a large number of small files for the Hive table and hence large number of mapper containers are being requested. Please let me know if there is a way to combine small files before being input to mapreduce job?

jyadav · ‎06-15-2016

@Phoncy Joseph

You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Or

Try using hadoop har file achieve to small file into single file.

https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files

View solution in original post

rajkumar_singh · ‎06-15-2016

Are you using HiveInputFormat, its better to use CombineInputFormat which combine all small files to generate a split.

jyadav · ‎06-15-2016

@Phoncy Joseph

You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Or

Try using hadoop har file achieve to small file into single file.

https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files

amruthkesav_s · ‎02-06-2018

Are there any counters that can assess this? I am trying the above properties but failing to see reduction in the number of mappers.

Cloudera Community

Support Questions

How to combine Hive table files for input to mapreduce?