question Re: df.cache() is not working on jdbc table in Archives of Support Questions (Read Only)

df.cache() is not working on jdbc table

papil_patil15 — Wed, 18 Jul 2018 15:08:27 GMT

I am creating a dataframe using pyspark sql jdbc.read(). I want to cache the data read from jdbc table into a df to use it further in joins and agg. By using df.cache() I cannot see any query in rdbms executed for reading data unless I do df.show(). It means that data is not cached yet. Whenever I am using this cached df in further joins and unions, each time a SELECT is executed in rdbms which is not expected and needs to be reduced.

What could be the possible reason for this behaviour. Is there any other way to cache data in df ?

df = spark.read
	.format("jdbc")\
	.option("url","---------------------------")\
	.option("driver","com.sap.db.jdbc.Driver")
	.option("CharSet","iso_1")\
	.option("user","---------------------------")\
	.option("password", "---------------------------")\
	.option("dbtable","(select * from schema.table_name ) tmp ")\
	.load()
df.cache()

Re: df.cache() is not working on jdbc table

sunile_manjee — Thu, 19 Jul 2018 19:52:04 GMT

You are experiencing sparks lazy execution. When you execute your code, nothing in spark has been executed. You need to cache post execution. For example, a easy way to fix/test this is to run a something against your DF... (ie select * from df). store the results in a another DF and cache it thereafter.

Re: df.cache() is not working on jdbc table

falbani — Thu, 19 Jul 2018 20:45:25 GMT

@Papil Patil

cache function is lazy, so in order to see the data cached you should actually perform an action that would trigger the execution of the dag. For example:

df = spark.read
	.format("jdbc")\
	.option("url","---------------------------")\
	.option("driver","com.sap.db.jdbc.Driver")
	.option("CharSet","iso_1")\
	.option("user","---------------------------")\
	.option("password", "---------------------------")\
	.option("dbtable","(select * from schema.table_name ) tmp ")\
	.load()
df.cache()
//this will trigger the dag and you should see data cache 
val count = df.count()
//next time it will just use the data in cache so it should be faster to execute
val count2 = df.count()

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.