Support Questions
Find answers, ask questions, and share your expertise

Hive query issue

Rising Star

Goodmorning everyone.

A customer has a problem with a query on hive (with Tez).

The query is as follows:

select a.hashedaddress from x a join x b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;

The process is stopped at the first step of mapping:

VERTICES	STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1           FAILED     62          0        0       62      43       7 

Map 3           KILLED     62          0        0       62       0       0 

Reducer 2       KILLED   1009          0        0     1009       0       0 

VERTICES: 00/03  [>>--------------------------] 0%    ELAPSED TIME: 26.69 s

Log error is this: vertexerror.txt

The table schema is as follows:

CREATE TABLE `x`(
  `guid` string,
  `brandname` string, `hashedaddress` string, [and more]) PARTITIONED BY (
  `calendardate` date) CLUSTERED BY (
  brandname) 

INTO 5 BUCKETS ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 

STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 

OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 

LOCATION
  '*****' 

TBLPROPERTIES (
  'transient_lastDdlTime'='1445504632')

Table contains some 20 million records, to do the tests we tried to apply a limit to the tables:

select a.hashedaddress from (select * from x limit 10000) a join (select * from x limit 10000) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;

The query runs without problems

VERTICES      	STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1 ..........SUCCEEDED     62         62        0        0       0       0 

Map 3 ..........SUCCEEDED     62         62        0        0       0       0 

Reducer 2 ......SUCCEEDED      1          1        0        0       0       0 

Reducer 4 ......SUCCEEDED      1          1        0        0       0       0 

VERTICES: 04/04  [==========================>>] 100%  ELAPSED TIME: 55.13 s
Then we have increased the limits up to the cap (100000. 1000000, 10000000) and it works correctly, After this changing the query removing the limitations and it works:
select a.hashedaddress from (select * from x) a join (select * from x) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;
VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
Map 1 ..........SUCCEEDED     62         62        0        0       0       0 

Map 3 ..........SUCCEEDED     62         62        0        0       0       0 

Reducer 2 ......SUCCEEDED     253        253       0        0       0       0 

VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 

Can you help me understand why not work the first query? The join does it with the same data. Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

Rising Star

Finally the customer has applied the change (hive.vectorized.execution.enabled=false). Everything works correctly now.

Thanks so much

View solution in original post

10 REPLIES 10

Master Collaborator

What is the HDP or Hive version?

Rising Star

hive 0.14.0.2.2 on HDP-2.2.7.1-10 ps: is hdinsight distribution.

Master Collaborator

Please provide the explain plan of the failing query. Also can you try the failing query with hive.vectorized.execution.enabled=false?

Rising Star

Hi Deepesh.

Tonight the customer tries to disable the vectorization, when give me a feedback I update you.

For the moment, thank you!

@Alessio Ubaldi

Just to test ...can you run the query withouth vectorization?

set hive.vectorized.execution.enabled=false

Rising Star

@Deepesh suggested this solution.

Friday I informed the customer, I still waiting for a response. Just answer me I'll know you.

Thank you!

Guru

Hive doesn't support non-equijoins yet. This is coming soon.

Rising Star

Thanks for answer!

Rising Star

Finally the customer has applied the change (hive.vectorized.execution.enabled=false). Everything works correctly now.

Thanks so much

View solution in original post