- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
df.cache() is not working on jdbc table
- Labels:
-
Apache Spark
Created ‎07-18-2018 08:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am creating a dataframe using pyspark sql jdbc.read(). I want to cache the data read from jdbc table into a df to use it further in joins and agg. By using df.cache() I cannot see any query in rdbms executed for reading data unless I do df.show(). It means that data is not cached yet. Whenever I am using this cached df in further joins and unions, each time a SELECT is executed in rdbms which is not expected and needs to be reduced.
What could be the possible reason for this behaviour. Is there any other way to cache data in df ?
df = spark.read .format("jdbc")\ .option("url","---------------------------")\ .option("driver","com.sap.db.jdbc.Driver") .option("CharSet","iso_1")\ .option("user","---------------------------")\ .option("password", "---------------------------")\ .option("dbtable","(select * from schema.table_name ) tmp ")\ .load() df.cache()
Created ‎07-19-2018 12:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are experiencing sparks lazy execution. When you execute your code, nothing in spark has been executed. You need to cache post execution. For example, a easy way to fix/test this is to run a something against your DF... (ie select * from df). store the results in a another DF and cache it thereafter.
Created ‎07-19-2018 12:52 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are experiencing sparks lazy execution. When you execute your code, nothing in spark has been executed. You need to cache post execution. For example, a easy way to fix/test this is to run a something against your DF... (ie select * from df). store the results in a another DF and cache it thereafter.
Created ‎07-19-2018 01:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
cache function is lazy, so in order to see the data cached you should actually perform an action that would trigger the execution of the dag. For example:
df = spark.read .format("jdbc")\ .option("url","---------------------------")\ .option("driver","com.sap.db.jdbc.Driver") .option("CharSet","iso_1")\ .option("user","---------------------------")\ .option("password", "---------------------------")\ .option("dbtable","(select * from schema.table_name ) tmp ")\ .load() df.cache() //this will trigger the dag and you should see data cache val count = df.count() //next time it will just use the data in cache so it should be faster to execute val count2 = df.count()
HTH
*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.
