Created 01-21-2016 11:56 AM
Goodmorning everyone.
A customer has a problem with a query on hive (with Tez).
The query is as follows:
select a.hashedaddress from x a join x b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;
The process is stopped at the first step of mapping:
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 FAILED 62 0 0 62 43 7 Map 3 KILLED 62 0 0 62 0 0 Reducer 2 KILLED 1009 0 0 1009 0 0 VERTICES: 00/03 [>>--------------------------] 0% ELAPSED TIME: 26.69 s
Log error is this: vertexerror.txt
The table schema is as follows:
CREATE TABLE `x`( `guid` string, `brandname` string, `hashedaddress` string, [and more]) PARTITIONED BY ( `calendardate` date) CLUSTERED BY ( brandname) INTO 5 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION '*****' TBLPROPERTIES ( 'transient_lastDdlTime'='1445504632')
Table contains some 20 million records, to do the tests we tried to apply a limit to the tables:
select a.hashedaddress from (select * from x limit 10000) a join (select * from x limit 10000) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;
The query runs without problems
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 ..........SUCCEEDED 62 62 0 0 0 0 Map 3 ..........SUCCEEDED 62 62 0 0 0 0 Reducer 2 ......SUCCEEDED 1 1 0 0 0 0 Reducer 4 ......SUCCEEDED 1 1 0 0 0 0 VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: 55.13 sThen we have increased the limits up to the cap (100000. 1000000, 10000000) and it works correctly, After this changing the query removing the limitations and it works:
select a.hashedaddress from (select * from x) a join (select * from x) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 ..........SUCCEEDED 62 62 0 0 0 0 Map 3 ..........SUCCEEDED 62 62 0 0 0 0 Reducer 2 ......SUCCEEDED 253 253 0 0 0 0 VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME:
Can you help me understand why not work the first query? The join does it with the same data. Thanks
Created 02-03-2016 08:24 AM
Finally the customer has applied the change (hive.vectorized.execution.enabled=false). Everything works correctly now.
Thanks so much
Created 02-02-2017 01:52 AM
Can you also post the time taken to execute the 20 mi records? thnx !