Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

how to optimize this query where the join happens between 2 table

how to optimize this query where the join happens between 2 table

New Contributor

I am trying to join 2 tables and the query that I am using is similar to the below

select * from table t1 join table t2 on

t1.cid=t2.cid and

t1.name = t2.name where t1.starttime > t2.starttime and t2.endtime<t2.endtime

The query gets heavily skewed up and takes the longest time in the last reducer and @99% runs for about an hours and fails. If I execute it for single day I get the output. I can send the log file if needed

1 REPLY 1
Highlighted

Re: how to optimize this query where the join happens between 2 table

Hi @Sajitha S Based on the question which you have described I understand for the two table on which you are performing join its getting skewed on the join fields.

However try setting this parameters in hive and execute your hive query.

set hive.optimixe.skewjoin=true;

set hive.exec.parallel=true;

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled = true;

set hive.vectorized.execution.reduce.groupby.enabled = true;

ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS;

ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS colnames;

Also if possible share the logs as well.

Thanks,

Bala

Don't have an account?
Coming from Hortonworks? Activate your account here