Support Questions
Find answers, ask questions, and share your expertise

Refresh Dataframe in Spark real-time Streaming without stopping process


Super Guru

@Mark Wallace

Can you please provide more details?

I have a few dataframes that I have been loaded from Hive tables before creating my JavaDStream. My spark streaming code performs joins on these reference data to come up with the final dataframe to persist in Hive. Every hour the reference data Hive tables are refreshed with new data and reloaded by separate Hive jobs (not the spark job) externally. Hence it mandates that my streaming job should also refresh the dataframes every hour or periodically to get the latest data. As I cannot stop my streaming job and restart to fetch the hive tables again and load into Dataframes, I need a way to refresh these dataframes on the fly.

// This is where I get the messages from kafka

JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream......

// get all hive tables (runs only once when the spark job is submitted, outside of the streaming loop)

DataFrame superbowlDF=hiveContext.table("nfl.superbowl").cache();

DataFrame patriotsDF=hiveContext.table("nfl.patriots").cache();

DataFrame falconsDF=hiveContext.table("nfl.falcons").cache();

// streaming loop - create RDDs for all streaming messages, runs contiunously

JavaDStream<team> NFLTeams =

How can I refresh patriotsDF, falconsDF, superbowlDF etc while the streaming job is still on? Can another process refresh and reload these dataframes and share with this process and new messages can use them to join seamlessly?

Super Guru
@Mark Wallace

Can you periodically run uncacheTable("table name") followed by cachetable("table name")? Both are inherited by HiveContext from SQLContext. Put the code you have above in a method along with adding lines for uncache and then simply call the method every hour. I am sorry if I am missing something.