question Re: How to combine Hive table files for input to mapreduce? in Archives of Support Questions (Read Only)

How to combine Hive table files for input to mapreduce?

phoncy_joseph — Wed, 15 Jun 2016 14:08:43 GMT

I have a HDP 2.0 cluster where I'm executing a mapreduce program which takes Hive(0.14) table as input. There are a large number of small files for the Hive table and hence large number of mapper containers are being requested. Please let me know if there is a way to combine small files before being input to mapreduce job?

Re: How to combine Hive table files for input to mapreduce?

rajkumar_singh — Wed, 15 Jun 2016 14:11:21 GMT

Are you using HiveInputFormat, its better to use CombineInputFormat which combine all small files to generate a split.

Re: How to combine Hive table files for input to mapreduce?

jyadav — Wed, 15 Jun 2016 16:01:13 GMT

@Phoncy Joseph

You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

set mapred.min.split.size=100000000;

Try using hadoop har file achieve to small file into single file.

https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files

Re: How to combine Hive table files for input to mapreduce?

amruthkesav_s — Tue, 06 Feb 2018 17:16:11 GMT

Are there any counters that can assess this? I am trying the above properties but failing to see reduction in the number of mappers.