Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark Cache Behaviour

Spark Cache Behaviour

New Contributor

Hi All,

I am stuck in a unique problem.

I created two dataframes -

val spostn = hiveCtx.sql("select ou_id,row_id,name,POSTN_TYPE_CD from src_si_S_POSTN)

val sorgext = hiveCtx.sql("select row_id,name,par_divn_id from src_si_S_ORG_EXT).persist()

sorgext.count()

I persist the second dataframe since I am going to use it multiple times in the code

If I join using this persist Dataframe as follows:

JOIN 1: val sub_divison_qry = spostn.as("spn").join(sorgext.as("soe"),spostn("OU_ID")===sorgext("ROW_ID") && spostn("POSTN_TYPE_CD")==="Regional Manager")

JOIN 2: val divison_qry = sorgext.as("soe2").join(sub_divison_qry.as("sub_divison_qry"),sorgext("ROW_ID")===sub_divison_qry("PAR_DIVN_ID")).

sub_divison_qry.head gives me some data, however divison_qry is empty

If I created 3rd Dataframe for same table (src_si_s_ORG_EXT) and then executed JOIN 1 & 2 with this dataframe:

val sorgext2 = hiveCtx.sql("select row_id,name,par_divn_id from src_si_S_ORG_EXT)

JOIN 1: val sub_divison_qry = spostn.as("spn").join(sorgext.as("soe"),spostn("OU_ID")===sorgext("ROW_ID") && spostn("POSTN_TYPE_CD")==="Regional Manager")

JOIN 2: val divison_qry = sorgext2.as("soe2").join(sub_divison_qry.as("sub_divison_qry"),sorgext2("ROW_ID")===sub_divison_qry("PAR_DIVN_ID"))

Both JOIN1 and JOIN 2 gives me Output

.

Now as per my understanding we have to cache the data so that it can be resused, however it seems this is not working and new dataframe is needed for same table in case of multiple join on same table.

Please help

Regards

Upendra