Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

how to optimize this query where the join happens between 2 table

New Contributor

I am trying to join 2 tables and the query that I am using is similar to the below

select * from table t1 join table t2 on

t1.cid=t2.cid and

t1.name = t2.name where t1.starttime > t2.starttime and t2.endtime<t2.endtime

The query gets heavily skewed up and takes the longest time in the last reducer and @99% runs for about an hours and fails. If I execute it for single day I get the output. I can send the log file if needed

1 REPLY 1

Hi @Sajitha S Based on the question which you have described I understand for the two table on which you are performing join its getting skewed on the join fields.

However try setting this parameters in hive and execute your hive query.

set hive.optimixe.skewjoin=true;

set hive.exec.parallel=true;

set hive.vectorized.execution.enabled = true;

set hive.vectorized.execution.reduce.enabled = true;

set hive.vectorized.execution.reduce.groupby.enabled = true;

ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS;

ANALYZE TABLE tablename COMPUTE STATISTICS FOR COLUMNS colnames;

Also if possible share the logs as well.

Thanks,

Bala

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.