I have a few dataframes that I have been loaded from Hive tables before creating my JavaDStream. My spark streaming code performs joins on these reference data to come up with the final dataframe to persist in Hive. Every hour the reference data Hive tables are refreshed with new data and reloaded by separate Hive jobs (not the spark job) externally. Hence it mandates that my streaming job should also refresh the dataframes every hour or periodically to get the latest data. As I cannot stop my streaming job and restart to fetch the hive tables again and load into Dataframes, I need a way to refresh these dataframes on the fly.
// This is where I get the messages from kafka
JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream......
// get all hive tables (runs only once when the spark job is submitted, outside of the streaming loop)
// streaming loop - create RDDs for all streaming messages, runs contiunously
JavaDStream<team> NFLTeams = messages.map...etc.
How can I refresh patriotsDF, falconsDF, superbowlDF etc while the streaming job is still on? Can another process refresh and reload these dataframes and share with this process and new messages can use them to join seamlessly?
Can you periodically run uncacheTable("table name") followed by cachetable("table name")? Both are inherited by HiveContext from SQLContext. Put the code you have above in a method along with adding lines for uncache and then simply call the method every hour. I am sorry if I am missing something.