Support Questions

aubaldi · ‎01-21-2016

Goodmorning everyone.

A customer has a problem with a query on hive (with Tez).

The query is as follows:

select a.hashedaddress from x a join x b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;

The process is stopped at the first step of mapping:

VERTICES	STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1           FAILED     62          0        0       62      43       7 

Map 3           KILLED     62          0        0       62       0       0 

Reducer 2       KILLED   1009          0        0     1009       0       0 

VERTICES: 00/03  [>>--------------------------] 0%    ELAPSED TIME: 26.69 s

Log error is this: vertexerror.txt

The table schema is as follows:

CREATE TABLE `x`(
  `guid` string,
  `brandname` string, `hashedaddress` string, [and more]) PARTITIONED BY (
  `calendardate` date) CLUSTERED BY (
  brandname) 

INTO 5 BUCKETS ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 

STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 

OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 

LOCATION
  '*****' 

TBLPROPERTIES (
  'transient_lastDdlTime'='1445504632')

Table contains some 20 million records, to do the tests we tried to apply a limit to the tables:

select a.hashedaddress from (select * from x limit 10000) a join (select * from x limit 10000) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;

The query runs without problems

VERTICES      	STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1 ..........SUCCEEDED     62         62        0        0       0       0 

Map 3 ..........SUCCEEDED     62         62        0        0       0       0 

Reducer 2 ......SUCCEEDED      1          1        0        0       0       0 

Reducer 4 ......SUCCEEDED      1          1        0        0       0       0 

VERTICES: 04/04  [==========================>>] 100%  ELAPSED TIME: 55.13 s

Then we have increased the limits up to the cap (100000. 1000000, 10000000) and it works correctly, After this changing the query removing the limitations and it works:

select a.hashedaddress from (select * from x) a join (select * from x) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;

VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
Map 1 ..........SUCCEEDED     62         62        0        0       0       0 

Map 3 ..........SUCCEEDED     62         62        0        0       0       0 

Reducer 2 ......SUCCEEDED     253        253       0        0       0       0 

VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME:

Can you help me understand why not work the first query? The join does it with the same data. Thanks

aubaldi · ‎02-03-2016

Finally the customer has applied the change (hive.vectorized.execution.enabled=false). Everything works correctly now.

Thanks so much

View solution in original post

deepesh1 · ‎01-21-2016

What is the HDP or Hive version?

aubaldi · ‎01-21-2016

hive 0.14.0.2.2 on HDP-2.2.7.1-10 ps: is hdinsight distribution.

deepesh1 · ‎01-21-2016

Please provide the explain plan of the failing query. Also can you try the failing query with hive.vectorized.execution.enabled=false?

aubaldi · ‎01-22-2016

Hi Deepesh.

Tonight the customer tries to disable the vectorization, when give me a feedback I update you.

For the moment, thank you!

nsabharwal · ‎01-23-2016

@Alessio Ubaldi

Just to test ...can you run the query withouth vectorization?

set hive.vectorized.execution.enabled=false

aubaldi · ‎01-25-2016

@Deepesh suggested this solution.

Friday I informed the customer, I still waiting for a response. Just answer me I'll know you.

Thank you!

sball · ‎01-24-2016

Hive doesn't support non-equijoins yet. This is coming soon.

aubaldi · ‎01-25-2016

Thanks for answer!

aubaldi · ‎02-03-2016

Finally the customer has applied the change (hive.vectorized.execution.enabled=false). Everything works correctly now.

Thanks so much

Cloudera Community

Support Questions

Hive query issue

Hive Query Recovery Tactics: Handling Failures thr...

Optimizing Hive queries for ORC formatted tables

Spark to support REGEX column specification for Hi...

Hive Metastore validate issue

Tuning Large Hive Queries - Part 1

Datanode Service Error Related to NFS Mount Issue

Hive view issue can not run any query

hive Insert to Dynamic Partition query Generating ...

jdbc hive interpreter fails while running hive que...

Unable to run queries from Hive LLAP queue