Created 06-15-2016 07:08 AM
I have a HDP 2.0 cluster where I'm executing a mapreduce program which takes Hive(0.14) table as input. There are a large number of small files for the Hive table and hence large number of mapper containers are being requested. Please let me know if there is a way to combine small files before being input to mapreduce job?
Created 06-15-2016 09:01 AM
You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; set mapred.min.split.size=100000000;
Or
Try using hadoop har file achieve to small file into single file.
https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files
Created 06-15-2016 07:11 AM
Are you using HiveInputFormat, its better to use CombineInputFormat which combine all small files to generate a split.
Created 06-15-2016 09:01 AM
You can set the input record size in hive to a higher value to reduce the number of mappers but you might need to increase the mapper heap size also.
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; set mapred.min.split.size=100000000;
Or
Try using hadoop har file achieve to small file into single file.
https://hadoop.apache.org/docs/r1.2.1/hadoop_archives.html#Looking+Up+Files
Created 02-06-2018 09:16 AM
Are there any counters that can assess this? I am trying the above properties but failing to see reduction in the number of mappers.