Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive + Tez :: A join query stuck at last 2 mappers for a long time

avatar
Contributor

I have a views table joining with a temp table with the below parameters intentionally enabled.

hive.auto.convert.join=true;    
hive.execution.engine=tez;

The Code Snippet is,

CREATE TABLE STG_CONVERSION AS    
SELECT CONV.CONVERSION_ID,
       CONV.USER_ID,
       TP.TIME,
       CONV.TIME AS ACTIVITY_TIME,
       TP.MULTI_DIM_ID,
       CONV.CONV_TYPE_ID,
       TP.SV1
FROM VIEWS TP
JOIN  SCU_TMP CONV ON TP.USER_ID = CONV.USER_ID
WHERE TP.TIME <= CONV.TIME;

In the normal scenario, both the tables can have any number of records.
However,in the SCU_TMP table, only 10-50 records are expected with the same User Id.

But in some cases, couple of User IDs come with around 10k-20k records in SCU Temp table, which creates a cross product effect.
In such cases, it'll run for ever with just 1 mapper to complete.

Is there any way to optimise this and run this gracefully?

2 REPLIES 2

avatar
Expert Contributor

Can you try to tune it by changing the following ( try 2G then 4G or 6G ..) :
set hive.tez.container.size=2048;
set hive.tez.java.opts=-Xmx2048m;

avatar
Expert Contributor
@Sooraj Antony

As you have scenarios for skew data in the joining column, enable skew join optimization.

set hive.optimize.skewjoin=true

set hive.skewjoin.key=5000

you can tune it further with number of mapper tasks and split size by hive.skewjoin.mapjoin.map.tasks and hive.skewjoin.mapjoin.min.split properties.