Member since
09-25-2015
109
Posts
36
Kudos Received
8
Solutions
03-18-2017
11:33 PM
2 Kudos
For users having hive insert query with dynamic partition, and partitions (> 10) on a column, you may notice that your query is generating too many small files per partition. INSERT OVERWRITE TABLE dB.Test partition(column5)
select
column1
,column2
,column3
,column4
,column5
from
Test2;
For Example, if your table has 2000 partitions, and your query is generating 1009 reducers (hive.exec.reducers.max), then you might end up with 2 million small files. To Understand "How Does Tez determine the number of reducers" refer: https://community.hortonworks.com/articles/22419/hive-on-tez-performance-tuning-determining-reducer.html) This could result into issues with: 1. HDFS Namenode performance: Refer: https://community.hortonworks.com/articles/15104/small-files-in-hadoop.html 2. File Merge Operation failing due java.lang.OutOfMemoryError: GC overhead limit exceeded . » File Merge
, java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
at java.lang.StringCoding.decode(StringCoding.java:193)
at java.lang.String.(String.java:414)
at com.google.protobuf.LiteralByteString.toString(LiteralByteString.java:148)
at com.google.protobuf.ByteString.toStringUtf8(ByteString.java:572)
at org.apache.hadoop.security.proto.SecurityProtos$TokenProto.getService(SecurityProtos.java:274)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:848)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:833)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1285)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1435)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1546)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.convert(PBHelper.java:1555)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getListing(ClientNamenodeProtocolTranslatorPB.java:621)
at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:278)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:194)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:176)
at com.sun.proxy.$Proxy13.getListing(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:2136)
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.hasNextNoFilter(DistributedFileSystem.java:1100)
at org.apache.hadoop.hdfs.DistributedFileSystem$DirListingIterator.hasNext(DistributedFileSystem.java:1075)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:304)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
at org.apache.hadoop.hive.shims.Hadoop23Shims$1.listStatus(Hadoop23Shims.java:148)
at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:217)
at org.apache.hadoop.mapred.lib.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:75)
at org.apache.hadoop.hive.shims.HadoopShimsSecure$CombineFileInputFormatShim.getSplits(HadoopShimsSecure.java:309)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.processPaths(CombineHiveInputFormat.java:596)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getCombineSplits(CombineHiveInputFormat.java:473)
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getSplits(CombineHiveInputFormat.java:571)
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
To avoid this issue, set the following property: set hive.optimize.sort.dynamic.partition=true; When enabled, dynamic partitioning column will be globally sorted. This way we can keep only one record writer open for each partition value in the reducer thereby reducing the memory pressure on reducers.
... View more
Labels:
01-03-2017
07:34 PM
Great tip. For people new to the Tez lexicon:
AM = application master
DAG = directed acyclic graph
... View more
12-22-2016
08:53 PM
1 Kudo
Steps: Oozie server timezone
For Ambari users, login and navigate to Custom Oozie-site. Ambari > Oozie > configs > custom oozie-site
Add Property "oozie.processing.timezone=GMT-0500" oozie.processing.timezone Default value=UTC Oozie server timezone. Valid values are UTC and GMT(+/-)####, for example 'GMT+0530' would be India timezone. All dates parsed and generated dates by Oozie Coordinator/Bundle will be done in the specified timezone. The default value of 'UTC' should not be changed under normal circumstances. If for any reason is changed, note that GMT(+/-)#### timezones do not observe DST changes.
Save and Restart Oozie Service Oozie Web Console To View the Job in EST Timezone on Oozie Web Console, follow below steps.
Open Oozie Web Console on http://<oozieUrl>:11000/oozie/ Navigate to "Settings" Tab Select from dropdown menu Timezone: EST
Navigate to "* Jobs" Tab and refresh to see the jobs in EST timezone Oozie Coordinator properties, use the time in EST and append "-500" to it. start="2016-12-22T15:46-0500" end="2016-12-22T18:00-0500"
... View more
Labels:
12-11-2017
03:50 PM
what should be the url/command when we need to access hadoop jobs for a specified time duration ?
... View more