I have a views table joining with a temp table with the below parameters intentionally enabled.
The Code Snippet is,
CREATE TABLE STG_CONVERSION AS SELECT CONV.CONVERSION_ID, CONV.USER_ID, TP.TIME, CONV.TIME AS ACTIVITY_TIME, TP.MULTI_DIM_ID, CONV.CONV_TYPE_ID, TP.SV1 FROM VIEWS TP JOIN SCU_TMP CONV ON TP.USER_ID = CONV.USER_ID WHERE TP.TIME <= CONV.TIME;
In the normal scenario, both the tables can have any number of records.
However,in the SCU_TMP table, only 10-50 records are expected with the same User Id.
But in some cases, couple of User IDs come with around 10k-20k records in SCU Temp table, which creates a cross product effect.
In such cases, it'll run for ever with just 1 mapper to complete.
Is there any way to optimise this and run this gracefully?
Can you try to tune it by changing the following ( try 2G then 4G or 6G ..) :
As you have scenarios for skew data in the joining column, enable skew join optimization.
you can tune it further with number of mapper tasks and split size by hive.skewjoin.mapjoin.map.tasks and hive.skewjoin.mapjoin.min.split properties.