Created 07-18-2018 08:08 AM
I am creating a dataframe using pyspark sql jdbc.read(). I want to cache the data read from jdbc table into a df to use it further in joins and agg. By using df.cache() I cannot see any query in rdbms executed for reading data unless I do df.show(). It means that data is not cached yet. Whenever I am using this cached df in further joins and unions, each time a SELECT is executed in rdbms which is not expected and needs to be reduced.
What could be the possible reason for this behaviour. Is there any other way to cache data in df ?
df = spark.read .format("jdbc")\ .option("url","---------------------------")\ .option("driver","com.sap.db.jdbc.Driver") .option("CharSet","iso_1")\ .option("user","---------------------------")\ .option("password", "---------------------------")\ .option("dbtable","(select * from schema.table_name ) tmp ")\ .load() df.cache()
Created 07-19-2018 12:52 PM
You are experiencing sparks lazy execution. When you execute your code, nothing in spark has been executed. You need to cache post execution. For example, a easy way to fix/test this is to run a something against your DF... (ie select * from df). store the results in a another DF and cache it thereafter.
Created 07-19-2018 12:52 PM
You are experiencing sparks lazy execution. When you execute your code, nothing in spark has been executed. You need to cache post execution. For example, a easy way to fix/test this is to run a something against your DF... (ie select * from df). store the results in a another DF and cache it thereafter.
Created 07-19-2018 01:45 PM
cache function is lazy, so in order to see the data cached you should actually perform an action that would trigger the execution of the dag. For example:
df = spark.read .format("jdbc")\ .option("url","---------------------------")\ .option("driver","com.sap.db.jdbc.Driver") .option("CharSet","iso_1")\ .option("user","---------------------------")\ .option("password", "---------------------------")\ .option("dbtable","(select * from schema.table_name ) tmp ")\ .load() df.cache() //this will trigger the dag and you should see data cache val count = df.count() //next time it will just use the data in cache so it should be faster to execute val count2 = df.count()
HTH
*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.