11-12-2018 04:17 AM
In Spark >= 2.0, the following statement should be correct:
df2_DF = df1_DF.checkpoint()
But of course if I try to perform checkpointing on a dataframe in Spark 1.6, I get the following exception, as checkpointing wasn't yet implemented:
AttributeError: 'DataFrame' object has no attribute 'checkpoint'
Is there a quick way to implement this in Spark 1.6.0 using a workaround?
What I've tried until now is (similar to what I saw in a post on Stack Overflow) the following approach, using an intermediate rdd conversion for the original Dataframe df1:
... sc.setCheckpointDir("hdfs:///tmp") ... df1 = sqlContext.table("<TABLE_NAME>") ... df1.rdd.checkpoint() df1.rdd.count df2 = sqlContext.createDataFrame(df1.rdd, df1.schema)
But I'd like to ask if:
- Everybody else is using this approach? Can you pls give feedback if you do?
- If there is a different, better approach? What I'm doing in reality is looping over a Dataframe and it gets "heavier" at each loop
- If the above is the only option, does it make sense if in the last statement I directly reassign the "should-now-be-checkpointed-rdd" to the original Dataframe? Something like:
df1 = sqlContext.createDataFrame(df1.rdd, df1.schema)
- Any other observations/comments?
Thanks in advance for any insight or help