Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive query issue

avatar
Expert Contributor

Goodmorning everyone.

A customer has a problem with a query on hive (with Tez).

The query is as follows:

select a.hashedaddress from x a join x b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;

The process is stopped at the first step of mapping:

VERTICES	STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1           FAILED     62          0        0       62      43       7 

Map 3           KILLED     62          0        0       62       0       0 

Reducer 2       KILLED   1009          0        0     1009       0       0 

VERTICES: 00/03  [>>--------------------------] 0%    ELAPSED TIME: 26.69 s

Log error is this: vertexerror.txt

The table schema is as follows:

CREATE TABLE `x`(
  `guid` string,
  `brandname` string, `hashedaddress` string, [and more]) PARTITIONED BY (
  `calendardate` date) CLUSTERED BY (
  brandname) 

INTO 5 BUCKETS ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 

STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 

OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' 

LOCATION
  '*****' 

TBLPROPERTIES (
  'transient_lastDdlTime'='1445504632')

Table contains some 20 million records, to do the tests we tried to apply a limit to the tables:

select a.hashedaddress from (select * from x limit 10000) a join (select * from x limit 10000) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;

The query runs without problems

VERTICES      	STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

Map 1 ..........SUCCEEDED     62         62        0        0       0       0 

Map 3 ..........SUCCEEDED     62         62        0        0       0       0 

Reducer 2 ......SUCCEEDED      1          1        0        0       0       0 

Reducer 4 ......SUCCEEDED      1          1        0        0       0       0 

VERTICES: 04/04  [==========================>>] 100%  ELAPSED TIME: 55.13 s
Then we have increased the limits up to the cap (100000. 1000000, 10000000) and it works correctly, After this changing the query removing the limitations and it works:
select a.hashedaddress from (select * from x) a join (select * from x) b on (a.hashedaddress = b.hashedaddress) where a.guid != b.guid;
VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
Map 1 ..........SUCCEEDED     62         62        0        0       0       0 

Map 3 ..........SUCCEEDED     62         62        0        0       0       0 

Reducer 2 ......SUCCEEDED     253        253       0        0       0       0 

VERTICES: 03/03  [==========================>>] 100%  ELAPSED TIME: 

Can you help me understand why not work the first query? The join does it with the same data. Thanks

1 ACCEPTED SOLUTION

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
10 REPLIES 10

avatar
New Contributor

Can you also post the time taken to execute the 20 mi records? thnx !