Reply
Highlighted
Contributor
Posts: 49
Registered: ‎01-05-2016

Checkpointing dataframe in Spark 1.6.0

Hello community,

 

In Spark >= 2.0, the following statement should be correct:

 

df2_DF = df1_DF.checkpoint()

But of course if I try to perform checkpointing on a dataframe in Spark 1.6, I get the following exception, as checkpointing wasn't yet implemented:

 

AttributeError: 'DataFrame' object has no attribute 'checkpoint'

Is there a quick way to implement this in Spark 1.6.0 using a workaround?

 

What I've tried until now is (similar to what I saw in a post on Stack Overflow) the following approach, using an intermediate rdd conversion for the original Dataframe df1:

 

...
sc.setCheckpointDir("hdfs:///tmp")
...
df1 = sqlContext.table("<TABLE_NAME>")
...
df1.rdd.checkpoint()
df1.rdd.count

df2 = sqlContext.createDataFrame(df1.rdd, df1.schema)

But I'd like to ask if:

 

- Everybody else is using this approach? Can you pls give feedback if you do?

 

- If there is a different, better approach? What I'm doing in reality is looping over a Dataframe and it gets "heavier" at each loop

 

- If the above is the only option, does it make sense if in the last statement I directly reassign the "should-now-be-checkpointed-rdd" to the original Dataframe? Something like:

 

df1 = sqlContext.createDataFrame(df1.rdd, df1.schema)

- Any other observations/comments?

 

Thanks in advance for any insight or help

 

 

 

 

Announcements