Created on 12-14-2017 10:21 AM - edited 08-17-2019 06:11 PM
I have a rather large partitioned table with the following description.
|PARTITIONED BY ( | | `dt_month` string) | | ROW FORMAT SERDE | | 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' | | STORED AS INPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' | | OUTPUTFORMAT | | 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' | | LOCATION | | '' | | TBLPROPERTIES ( | | 'COLUMN_STATS_ACCURATE'='true', | | 'last_modified_by'='user1', | | 'last_modified_time'='1483944081', | | 'orc.bloom.filter.columns'='id2,id1,pjkd,gen_ds,key,rf', | | 'orc.bloom.filter.fpp'='0.05', | | 'orc.create.index'='true', | | 'spark.sql.partitionProvider'='catalog', | | 'spark.sql.statistics.colStats.id1.avgLen'='10', | | 'spark.sql.statistics.colStats.id1.distinctCount'='36919609', | | 'spark.sql.statistics.colStats.id1.maxLen'='10', | | 'spark.sql.statistics.colStats.id1.nullCount'='0', | | 'spark.sql.statistics.colStats.id1.version'='1', | | 'spark.sql.statistics.colStats.id2.avgLen'='10', | | 'spark.sql.statistics.colStats.id2.distinctCount'='9936718', | | 'spark.sql.statistics.colStats.id2.maxLen'='10', | | 'spark.sql.statistics.colStats.id2.nullCount'='0', | | 'spark.sql.statistics.colStats.id2.version'='1', | | 'spark.sql.statistics.numRows'='20623558494', | | 'spark.sql.statistics.totalSize'='1355372444614', | | 'transient_lastDdlTime'='1504869132') | +------------------------------------------------------------------------------------------------+--+
I tried to execute a simple max on one of the columns with the following query and it shows no progress whatsoever. I have checked to ensure there are no locks on the table. I have checked that there are enough free resources on the cluster available. There is an application id created for the hive query and it shows up on yarn with no progress.
Any tips on how I can start debugging would be very welcome!
set hive.exec.orc.memory.pool=1.0; set hive.exec.parallel=true; set tez.runtime.unordered.output.buffer.size-mb=819; set tez.runtime.io.sort.mb = 3276; set hive.tez.container.size = 9216; set hive.tez.java.opts = -server -Djava.net.preferIPv4Stack=true -XX:NewRatio=8 -XX:+UseNUMA -XX:+UseParallelGC -XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps; set mapreduce.map.memory.mb=5120 set mapreduce.map.java.opts=-Xmx4096m set mapreduce.reduce.memory.mb=12288 set mapreduce.reduce.java.opts=-Xmx9830m set tez.am.container.reuse.enabled=true; set hive.mapjoin.optimized.hashtable=true; set hive.prewarm.enabled=true; set hive.tez.dynamic.partition.pruning=false; select 'g' + max(gen_ds) as gen_ds_old from table;
Created 12-14-2017 02:31 PM
Query succeeds on mr!
Created 12-14-2017 08:26 PM
Hi Subramaniam:
It is likely that the job is exceeding the available YARN resources on your cluster or as defined by your default YARN queue. Take a look at the YARN Resource Manager UI for the job. You will likely see that the required number of vcores or RAM exceeds the resources made available to YARN on your cluster.
Thank you,
Jeff Groves
Created 12-20-2017 12:49 PM
Hi @Jeff Groves,
thanks for looking into this. But that doesnt seem very probably. Right now there is about 4TB of memory available on the cluster and there are 400Vcores available for the default queue that I am running on. The container size on yarn can be anything between 4 and 156GB and between 1 and 16 Vcores based on the configuration in yarn.
Given that I am requesting roughly 9GB for each tez container and the default number of vcores(1 I assume), I shouldnt be exceeding these in anyway!
The only problem that I can see is that, since this is a large table, I expect some 4000 containers to be requested totally but shouldnt yarn just handle them on a max allocation basis?(like 300 at a time and they get scheduled as some complete and become available?) If not, how exactly can I control this behavior?