Created 03-12-2018 06:44 PM
I have a views table joining with a temp table with the below parameters intentionally enabled.
hive.auto.convert.join=true;
hive.execution.engine=tez;
The Code Snippet is,
CREATE TABLE STG_CONVERSION AS
SELECT CONV.CONVERSION_ID,
CONV.USER_ID,
TP.TIME,
CONV.TIME AS ACTIVITY_TIME,
TP.MULTI_DIM_ID,
CONV.CONV_TYPE_ID,
TP.SV1
FROM VIEWS TP
JOIN SCU_TMP CONV ON TP.USER_ID = CONV.USER_ID
WHERE TP.TIME <= CONV.TIME;
In the normal scenario, both the tables can have any number of records.
However,in the SCU_TMP table, only 10-50 records are expected with the same User Id.
But in some cases, couple of User IDs come with around 10k-20k records in SCU Temp table, which creates a cross product effect.
In such cases, it'll run for ever with just 1 mapper to complete.
Is there any way to optimise this and run this gracefully?
Created 03-16-2018 10:13 PM
Can you try to tune it by changing the following ( try 2G then 4G or 6G ..) :
set hive.tez.container.size=2048;
set hive.tez.java.opts=-Xmx2048m;
Created 03-17-2018 03:31 PM
As you have scenarios for skew data in the joining column, enable skew join optimization.
set hive.optimize.skewjoin=true
set hive.skewjoin.key=5000
you can tune it further with number of mapper tasks and split size by hive.skewjoin.mapjoin.map.tasks and hive.skewjoin.mapjoin.min.split properties.