Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

df.cache() is not working on jdbc table

avatar
New Contributor

I am creating a dataframe using pyspark sql jdbc.read(). I want to cache the data read from jdbc table into a df to use it further in joins and agg. By using df.cache() I cannot see any query in rdbms executed for reading data unless I do df.show(). It means that data is not cached yet. Whenever I am using this cached df in further joins and unions, each time a SELECT is executed in rdbms which is not expected and needs to be reduced.

What could be the possible reason for this behaviour. Is there any other way to cache data in df ?

df = spark.read
	.format("jdbc")\
	.option("url","---------------------------")\
	.option("driver","com.sap.db.jdbc.Driver")
	.option("CharSet","iso_1")\
	.option("user","---------------------------")\
	.option("password", "---------------------------")\
	.option("dbtable","(select * from schema.table_name ) tmp ")\
	.load()
df.cache()

1 ACCEPTED SOLUTION

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
2 REPLIES 2

avatar
Master Guru
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar

@Papil Patil

cache function is lazy, so in order to see the data cached you should actually perform an action that would trigger the execution of the dag. For example:

df = spark.read
	.format("jdbc")\
	.option("url","---------------------------")\
	.option("driver","com.sap.db.jdbc.Driver")
	.option("CharSet","iso_1")\
	.option("user","---------------------------")\
	.option("password", "---------------------------")\
	.option("dbtable","(select * from schema.table_name ) tmp ")\
	.load()
df.cache()
//this will trigger the dag and you should see data cache 
val count = df.count()
//next time it will just use the data in cache so it should be faster to execute
val count2 = df.count()

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.