Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Checkpointing dataframe in Spark 1.6.0

Checkpointing dataframe in Spark 1.6.0

Rising Star

Hello community,

 

In Spark >= 2.0, the following statement should be correct:

 

df2_DF = df1_DF.checkpoint()

But of course if I try to perform checkpointing on a dataframe in Spark 1.6, I get the following exception, as checkpointing wasn't yet implemented:

 

AttributeError: 'DataFrame' object has no attribute 'checkpoint'

Is there a quick way to implement this in Spark 1.6.0 using a workaround?

 

What I've tried until now is (similar to what I saw in a post on Stack Overflow) the following approach, using an intermediate rdd conversion for the original Dataframe df1:

 

...
sc.setCheckpointDir("hdfs:///tmp")
...
df1 = sqlContext.table("<TABLE_NAME>")
...
df1.rdd.checkpoint()
df1.rdd.count

df2 = sqlContext.createDataFrame(df1.rdd, df1.schema)

But I'd like to ask if:

 

- Everybody else is using this approach? Can you pls give feedback if you do?

 

- If there is a different, better approach? What I'm doing in reality is looping over a Dataframe and it gets "heavier" at each loop

 

- If the above is the only option, does it make sense if in the last statement I directly reassign the "should-now-be-checkpointed-rdd" to the original Dataframe? Something like:

 

df1 = sqlContext.createDataFrame(df1.rdd, df1.schema)

- Any other observations/comments?

 

Thanks in advance for any insight or help