Posts: 49
Registered: ‎01-05-2016

Checkpointing dataframe in Spark 1.6.0

Hello community,


In Spark >= 2.0, the following statement should be correct:


df2_DF = df1_DF.checkpoint()

But of course if I try to perform checkpointing on a dataframe in Spark 1.6, I get the following exception, as checkpointing wasn't yet implemented:


AttributeError: 'DataFrame' object has no attribute 'checkpoint'

Is there a quick way to implement this in Spark 1.6.0 using a workaround?


What I've tried until now is (similar to what I saw in a post on Stack Overflow) the following approach, using an intermediate rdd conversion for the original Dataframe df1:


df1 = sqlContext.table("<TABLE_NAME>")

df2 = sqlContext.createDataFrame(df1.rdd, df1.schema)

But I'd like to ask if:


- Everybody else is using this approach? Can you pls give feedback if you do?


- If there is a different, better approach? What I'm doing in reality is looping over a Dataframe and it gets "heavier" at each loop


- If the above is the only option, does it make sense if in the last statement I directly reassign the "should-now-be-checkpointed-rdd" to the original Dataframe? Something like:


df1 = sqlContext.createDataFrame(df1.rdd, df1.schema)

- Any other observations/comments?


Thanks in advance for any insight or help