Created 01-21-2016 11:56 AM
Goodmorning everyone.
A customer has a problem with a query on hive (with Tez).
The query is as follows:
select a.hashedaddress from x a join x b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;
The process is stopped at the first step of mapping:
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 FAILED 62 0 0 62 43 7 Map 3 KILLED 62 0 0 62 0 0 Reducer 2 KILLED 1009 0 0 1009 0 0 VERTICES: 00/03 [>>--------------------------] 0% ELAPSED TIME: 26.69 s
Log error is this: vertexerror.txt
The table schema is as follows:
CREATE TABLE `x`( `guid` string, `brandname` string, `hashedaddress` string, [and more]) PARTITIONED BY ( `calendardate` date) CLUSTERED BY ( brandname) INTO 5 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION '*****' TBLPROPERTIES ( 'transient_lastDdlTime'='1445504632')
Table contains some 20 million records, to do the tests we tried to apply a limit to the tables:
select a.hashedaddress from (select * from x limit 10000) a join (select * from x limit 10000) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;
The query runs without problems
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 ..........SUCCEEDED 62 62 0 0 0 0 Map 3 ..........SUCCEEDED 62 62 0 0 0 0 Reducer 2 ......SUCCEEDED 1 1 0 0 0 0 Reducer 4 ......SUCCEEDED 1 1 0 0 0 0 VERTICES: 04/04 [==========================>>] 100% ELAPSED TIME: 55.13 sThen we have increased the limits up to the cap (100000. 1000000, 10000000) and it works correctly, After this changing the query removing the limitations and it works:
select a.hashedaddress from (select * from x) a join (select * from x) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED Map 1 ..........SUCCEEDED 62 62 0 0 0 0 Map 3 ..........SUCCEEDED 62 62 0 0 0 0 Reducer 2 ......SUCCEEDED 253 253 0 0 0 0 VERTICES: 03/03 [==========================>>] 100% ELAPSED TIME:
Can you help me understand why not work the first query? The join does it with the same data. Thanks
Created 02-03-2016 08:24 AM
Finally the customer has applied the change (hive.vectorized.execution.enabled=false). Everything works correctly now.
Thanks so much
Created 01-21-2016 04:31 PM
What is the HDP or Hive version?
Created 01-21-2016 04:46 PM
hive 0.14.0.2.2 on HDP-2.2.7.1-10 ps: is hdinsight distribution.
Created 01-21-2016 05:04 PM
Please provide the explain plan of the failing query. Also can you try the failing query with hive.vectorized.execution.enabled=false?
Created 01-22-2016 02:24 PM
Hi Deepesh.
Tonight the customer tries to disable the vectorization, when give me a feedback I update you.
For the moment, thank you!
Created 01-23-2016 06:33 PM
Just to test ...can you run the query withouth vectorization?
set hive.vectorized.execution.enabled=false
Created 01-25-2016 09:45 AM
@Deepesh suggested this solution.
Friday I informed the customer, I still waiting for a response. Just answer me I'll know you.
Thank you!
Created 01-24-2016 09:09 AM
Hive doesn't support non-equijoins yet. This is coming soon.
Created 01-25-2016 09:40 AM
Thanks for answer!
Created 02-03-2016 08:24 AM
Finally the customer has applied the change (hive.vectorized.execution.enabled=false). Everything works correctly now.
Thanks so much