Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Hive query issue

avatar
Expert Contributor

Goodmorning everyone.

A customer has a problem with a query on hive (with Tez).

The query is as follows:

select a.hashedaddress from x a join x b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;

The process is stopped at the first step of mapping:

VERTICES	STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1           FAILED     62          0        0       62      43       7 

Map 3           KILLED     62          0        0       62       0       0 

Reducer 2       KILLED   1009          0        0     1009       0       0 

VERTICES: 00/03  [>>--------------------------] 0%    ELAPSED TIME: 26.69 s

Log error is this: vertexerror.txt

The table schema is as follows:

CREATE TABLE `x`(
  `guid` string,
  `brandname` string, `hashedaddress` string, [and more]) PARTITIONED BY (
  `calendardate` date) CLUSTERED BY (
  brandname) 

INTO 5 BUCKETS ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 

STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 

OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 

LOCATION
  '*****' 

TBLPROPERTIES (
  'transient_lastDdlTime'='1445504632')

Table contains some 20 million records, to do the tests we tried to apply a limit to the tables:

select a.hashedaddress from (select * from x limit 10000) a join (select * from x limit 10000) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;

The query runs without problems

VERTICES      	STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1 ..........SUCCEEDED     62         62        0        0       0       0 

Map 3 ..........SUCCEEDED     62         62        0        0       0       0 

Reducer 2 ......SUCCEEDED      1          1        0        0       0       0 

Reducer 4 ......SUCCEEDED      1          1        0        0       0       0 

VERTICES: 04/04  [==========================>>] 100%  ELAPSED TIME: 55.13 s
Then we have increased the limits up to the cap (100000. 1000000, 10000000) and it works correctly, After this changing the query removing the limitations and it works:
select a.hashedaddress from (select * from x) a join (select * from x) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;
VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
Map 1 ..........SUCCEEDED     62         62        0        0       0       0 

Map 3 ..........SUCCEEDED     62         62        0        0       0       0 

Reducer 2 ......SUCCEEDED     253        253       0        0       0       0 

VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 

Can you help me understand why not work the first query? The join does it with the same data. Thanks

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Finally the customer has applied the change (hive.vectorized.execution.enabled=false). Everything works correctly now.

Thanks so much

View solution in original post

10 REPLIES 10

avatar

What is the HDP or Hive version?

avatar
Expert Contributor

hive 0.14.0.2.2 on HDP-2.2.7.1-10 ps: is hdinsight distribution.

avatar

Please provide the explain plan of the failing query. Also can you try the failing query with hive.vectorized.execution.enabled=false?

avatar
Expert Contributor

Hi Deepesh.

Tonight the customer tries to disable the vectorization, when give me a feedback I update you.

For the moment, thank you!

avatar
Master Mentor
@Alessio Ubaldi

Just to test ...can you run the query withouth vectorization?

set hive.vectorized.execution.enabled=false

avatar
Expert Contributor

@Deepesh suggested this solution.

Friday I informed the customer, I still waiting for a response. Just answer me I'll know you.

Thank you!

avatar
Guru

Hive doesn't support non-equijoins yet. This is coming soon.

avatar
Expert Contributor

Thanks for answer!

avatar
Expert Contributor

Finally the customer has applied the change (hive.vectorized.execution.enabled=false). Everything works correctly now.

Thanks so much