Support Questions
Find answers, ask questions, and share your expertise

Query hanging when running large query

New Contributor

Hello,

 

I've run into a problemw where a large query is hanging at 75% reduce with no error, it just never progresses beyond there.

 

I've got two files with 17 columns and 100,000,000 rows each.  The query runs on smaller versions of the file, where it does not need to do any reduce, but as soon as I need to do a reduce it hangs at 75%. 

 

We are running a 5.3.x cluster with 8 nodes (4 data nodes).  The cluster is not maxxing RAM, CPU, IO or anything else I can see, so some help on where to look would be great!

 

Here's the query:

 

SELECT
aa.*, bb.*
FROM
a54 aa,
b42 bb
WHERE
(aa.col_0 + aa.col_0 + bb.col_0 = 73)
and
(aa.col_1 + aa.col_1 + bb.col_1 = 73)
and
(aa.col_2 + aa.col_2 + bb.col_2 = 73)
and
(aa.col_3 + aa.col_3 + bb.col_3 = 73)
and
(aa.col_4 + aa.col_4 + bb.col_4 = 73)
and
(aa.col_5 + aa.col_5 + bb.col_5 = 73)
and
(aa.col_6 + aa.col_6 + bb.col_6 = 73)
and
(aa.col_7 + aa.col_7 + bb.col_7 = 73)
and
(aa.col_8 + aa.col_8 + bb.col_8 = 73)
and
(aa.col_9 + aa.col_9 + bb.col_9 = 73)
and
(aa.col_10 + aa.col_10 + bb.col_10 = 73)
and
(aa.col_11 + aa.col_11 + bb.col_11 = 73)
and
(aa.col_12 + aa.col_12 + bb.col_12 = 73)
and
(aa.col_13 + aa.col_13 + bb.col_13 = 73)
and
(aa.col_14 + aa.col_14 + bb.col_14 = 73)
and
(aa.col_15 + aa.col_15 + bb.col_15 = 73)
and
(aa.col_16 + aa.col_16 + bb.col_16 = 73);

4 REPLIES 4

Contributor

It'll take some digging through MR/YARN logs to find the issue..  

 

I'm curious why this needs a reduce phase though, can you paste the explain plan?

 

Thanks

New Contributor

Thanks!  Yes, on smaller tests it does not require the reduce, but when I run the entire files it does.

 

Here is the explain plan:

 

Explain
STAGE DEPENDENCIES:
" Stage-5 is a root stage , consists of Stage-1"
Stage-1
Stage-0 is a root stage

STAGE PLANS:
Stage: Stage-5
Conditional Operator

Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: aa
Statistics: Num rows: 30952380 Data size: 5200000000 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 30952380 Data size: 5200000000 Basic stats: COMPLETE Column stats: NONE
" value expressions: col_0 (type: tinyint), col_1 (type: tinyint), col_2 (type: tinyint), col_3 (type: tinyint), col_4 (type: tinyint), col_5 (type: tinyint), col_6 (type: tinyint), col_7 (type: tinyint), col_8 (type: tinyint), col_9 (type: tinyint), col_10 (type: tinyint), col_11 (type: tinyint), col_12 (type: tinyint), col_13 (type: tinyint), col_14 (type: tinyint), col_15 (type: tinyint), col_16 (type: tinyint), col_17 (type: string)"
TableScan
alias: bb
Statistics: Num rows: 16255253 Data size: 2730882560 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
sort order:
Statistics: Num rows: 16255253 Data size: 2730882560 Basic stats: COMPLETE Column stats: NONE
" value expressions: col_0 (type: tinyint), col_1 (type: tinyint), col_2 (type: tinyint), col_3 (type: tinyint), col_4 (type: tinyint), col_5 (type: tinyint), col_6 (type: tinyint), col_7 (type: tinyint), col_8 (type: tinyint), col_9 (type: tinyint), col_10 (type: tinyint), col_11 (type: tinyint), col_12 (type: tinyint), col_13 (type: tinyint), col_14 (type: tinyint), col_15 (type: tinyint), col_16 (type: tinyint), col_17 (type: string)"
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11} {VALUE._col12} {VALUE._col13} {VALUE._col14} {VALUE._col15} {VALUE._col16} {VALUE._col17}
1 {VALUE._col0} {VALUE._col1} {VALUE._col2} {VALUE._col3} {VALUE._col4} {VALUE._col5} {VALUE._col6} {VALUE._col7} {VALUE._col8} {VALUE._col9} {VALUE._col10} {VALUE._col11} {VALUE._col12} {VALUE._col13} {VALUE._col14} {VALUE._col15} {VALUE._col16} {VALUE._col17}
" outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35, _col36, _col37"
Statistics: Num rows: 34047620 Data size: 5720000000 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: (((((((((((((((((((_col0 + _col0) + _col20) = 73) and (((_col1 + _col1) + _col21) = 73)) and (((_col2 + _col2) + _col22) = 73)) and (((_col3 + _col3) + _col23) = 73)) and (((_col4 + _col4) + _col24) = 73)) and (((_col5 + _col5) + _col25) = 73)) and (((_col6 + _col6) + _col26) = 73)) and (((_col7 + _col7) + _col27) = 73)) and (((_col8 + _col8) + _col28) = 73)) and (((_col9 + _col9) + _col29) = 73)) and (((_col10 + _col10) + _col30) = 73)) and (((_col11 + _col11) + _col31) = 73)) and (((_col12 + _col12) + _col32) = 73)) and (((_col13 + _col13) + _col33) = 73)) and (((_col14 + _col14) + _col34) = 73)) and (((_col15 + _col15) + _col35) = 73)) and (((_col16 + _col16) + _col36) = 73)) (type: boolean)
Statistics: Num rows: 259 Data size: 43511 Basic stats: COMPLETE Column stats: NONE
Select Operator
" expressions: _col0 (type: tinyint), _col1 (type: tinyint), _col2 (type: tinyint), _col3 (type: tinyint), _col4 (type: tinyint), _col5 (type: tinyint), _col6 (type: tinyint), _col7 (type: tinyint), _col8 (type: tinyint), _col9 (type: tinyint), _col10 (type: tinyint), _col11 (type: tinyint), _col12 (type: tinyint), _col13 (type: tinyint), _col14 (type: tinyint), _col15 (type: tinyint), _col16 (type: tinyint), _col17 (type: string), _col20 (type: tinyint), _col21 (type: tinyint), _col22 (type: tinyint), _col23 (type: tinyint), _col24 (type: tinyint), _col25 (type: tinyint), _col26 (type: tinyint), _col27 (type: tinyint), _col28 (type: tinyint), _col29 (type: tinyint), _col30 (type: tinyint), _col31 (type: tinyint), _col32 (type: tinyint), _col33 (type: tinyint), _col34 (type: tinyint), _col35 (type: tinyint), _col36 (type: tinyint), _col37 (type: string)"
" outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12, _col13, _col14, _col15, _col16, _col17, _col18, _col19, _col20, _col21, _col22, _col23, _col24, _col25, _col26, _col27, _col28, _col29, _col30, _col31, _col32, _col33, _col34, _col35"
Statistics: Num rows: 259 Data size: 43511 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 259 Data size: 43511 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

Stage: Stage-0
Fetch Operator
limit: -1

Contributor

Yea it makes sense that it is doing reduce-phase, I missed the part where there is a join.

 

I dont have much clue, you might have to go to the MR logs (from YARN UI), can you look at the stuck reducers and if the logs say anything?

New Contributor

Thanks again for the response!  

 

I've gone into all of the logs I can find, they don't seem to indicate any issue, the job just starts spinning.  

 

015-10-16 15:58:32,793 INFO [IPC Server handler 1 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: MapCompletionEvents request from attempt_1445011592055_0013_r_000000_0. startIndex 0 maxEvents 10000
2015-10-16 15:58:33,299 INFO [IPC Server handler 9 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.0
2015-10-16 15:58:35,152 INFO [IPC Server handler 9 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.0
2015-10-16 15:58:38,666 INFO [IPC Server handler 0 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.67554396
2015-10-16 15:58:41,717 INFO [IPC Server handler 5 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.70821464
2015-10-16 15:58:44,761 INFO [IPC Server handler 1 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.7413749
2015-10-16 15:58:47,804 INFO [IPC Server handler 2 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.75381577
2015-10-16 15:58:50,851 INFO [IPC Server handler 7 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.75381577
2015-10-16 15:58:53,892 INFO [IPC Server handler 9 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.75381577
2015-10-16 15:58:56,932 INFO [IPC Server handler 6 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.75381577
2015-10-16 15:58:59,973 INFO [IPC Server handler 15 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.75381577
2015-10-16 15:59:03,011 INFO [IPC Server handler 13 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.75381577
2015-10-16 15:59:06,049 INFO [IPC Server handler 22 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.75381577
2015-10-16 15:59:09,087 INFO [IPC Server handler 8 on 56149] org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1445011592055_0013_r_000000_0 is : 0.75381577