Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to cache the left most table in a left outer join in hive

Highlighted

How to cache the left most table in a left outer join in hive

New Contributor
I have a large table (1Tb of data) that needs to be joined with a smaller table (100k records).


In the above scenario, I am not able to control which table has to be cached into memory.I have tried using MAPJOIN , STREAMTABLE hints and also tried with parameters like conditional task size , small tbale size etc. Since the small table is on the left most side of the join, its not being cached into memory

Is there way to control the table that needs to be cached

Note: I cannot alter the table positions or the code:


SELECT st.id FROM small_table st LEFT JOIN large_table lt        ON st.id = lt.id
set hive.execution.engine=tez; set hive.tez.container.size=4096; set hive.merge.mapredfiles=true; set tez.shuffle-vertex-manager.min-src-fraction=0.25; set tez.shuffle-vertex-manager.max-src-fraction=0.75; set hive.exec.dynamic.partition.mode=nonstrict; set tez.am.resource.memory.mb=3200 ; set tez.am.java.opts=-server -Xmx3200m -Djava.net.preferIPv4Stack=true -XX:+UseNUMA -XX:+UseParallelGC -XX:+UseConcMarkSweepGC ; SET hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask.size=288435456;