Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)
Expert Contributor

For users having hive insert query with dynamic partition, and partitions (> 10) on a column, you may notice that your query is generating too many small files per partition.

INSERT OVERWRITE TABLE dB.Test partition(column5)
select
   column1
  ,column2
  ,column3
  ,column4
  ,column5
from
  Test2;

For Example, if your table has 2000 partitions, and your query is generating 1009 reducers (hive.exec.reducers.max), then you might end up with 2 million small files.

To Understand "How Does Tez determine the number of reducers"

refer: https://community.hortonworks.com/articles/22419/hive-on-tez-performance-tuning-determining-reducer....

This could result into issues with:

1. HDFS Namenode performance:

Refer: https://community.hortonworks.com/articles/15104/small-files-in-hadoop.html

2. File Merge Operation failing due java.lang.OutOfMemoryError: GC overhead limit exceeded .

» File Merge 
, java.lang.OutOfMemoryError: GC overhead limit exceeded 
  at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149) 
  at java.lang.StringCoding.decode(StringCoding.java:193) 
  at java.lang.String.(String.java:414) 
  at com.google.protobuf.LiteralByteString.toString(LiteralByteString.java:148) 
  at com.google.protobuf.ByteString.toStringUtf8(ByteString.java:572) 
  at org.apache.hadoop.security.proto.SecurityProtos$TokenProto.getService(SecurityProtos.java:274) 
  at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:848) 
  at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:833) 
  at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1285) 
  at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1435) 
  at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1546) 
  at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1555) 
  at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:621) 
  at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source) 
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
  at java.lang.reflect.Method.invoke(Method.java:497) 
  at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278) 
  at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194) 
  at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176) 
  at com.sun.proxy.$Proxy13.getListing(Unknown Source) 
  at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2136) 
  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.hasNextNoFilter(DistributedFileSystem.java:1100) 
  at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.hasNext(DistributedFileSystem.java:1075) 
  at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304) 
  at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) 
  at org.apache.hadoop.hive.shims.Hadoop23Shims$1.listStatus(Hadoop23Shims.java:148) 
  at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217) 
  at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:75) 
  at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:309) 
  at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.processPaths(CombineHiveInputFormat.java:596) 
  at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:473) 
  at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:571) 

DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 

To avoid this issue, set the following property:

set hive.optimize.sort.dynamic.partition=true;

When enabled, dynamic partitioning column will be globally sorted. This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.

5,335 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎03-18-2017 11:33 PM
Updated by:
 
Contributors
Top Kudoed Authors