Support Questions

papil_patil15 · ‎07-18-2018

I am creating a dataframe using pyspark sql jdbc.read(). I want to cache the data read from jdbc table into a df to use it further in joins and agg. By using df.cache() I cannot see any query in rdbms executed for reading data unless I do df.show(). It means that data is not cached yet. Whenever I am using this cached df in further joins and unions, each time a SELECT is executed in rdbms which is not expected and needs to be reduced.

What could be the possible reason for this behaviour. Is there any other way to cache data in df ?

df = spark.read
	.format("jdbc")\
	.option("url","---------------------------")\
	.option("driver","com.sap.db.jdbc.Driver")
	.option("CharSet","iso_1")\
	.option("user","---------------------------")\
	.option("password", "---------------------------")\
	.option("dbtable","(select * from schema.table_name ) tmp ")\
	.load()
df.cache()

sunile_manjee · ‎07-19-2018

You are experiencing sparks lazy execution. When you execute your code, nothing in spark has been executed. You need to cache post execution. For example, a easy way to fix/test this is to run a something against your DF... (ie select * from df). store the results in a another DF and cache it thereafter.

View solution in original post

sunile_manjee · ‎07-19-2018

You are experiencing sparks lazy execution. When you execute your code, nothing in spark has been executed. You need to cache post execution. For example, a easy way to fix/test this is to run a something against your DF... (ie select * from df). store the results in a another DF and cache it thereafter.

falbani · ‎07-19-2018

@Papil Patil

cache function is lazy, so in order to see the data cached you should actually perform an action that would trigger the execution of the dag. For example:

df = spark.read
	.format("jdbc")\
	.option("url","---------------------------")\
	.option("driver","com.sap.db.jdbc.Driver")
	.option("CharSet","iso_1")\
	.option("user","---------------------------")\
	.option("password", "---------------------------")\
	.option("dbtable","(select * from schema.table_name ) tmp ")\
	.load()
df.cache()
//this will trigger the dag and you should see data cache 
val count = df.count()
//next time it will just use the data in cache so it should be faster to execute
val count2 = df.count()

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

Cloudera Community

Support Questions

df.cache() is not working on jdbc table