About papil_patil15

papil_patil15 · ‎07-18-2018

I am creating a dataframe using pyspark sql jdbc.read(). I want to cache the data read from jdbc table into a df to use it further in joins and agg. By using df.cache() I cannot see any query in rdbms executed for reading data unless I do df.show(). It means that data is not cached yet. Whenever I am using this cached df in further joins and unions, each time a SELECT is executed in rdbms which is not expected and needs to be reduced. What could be the possible reason for this behaviour. Is there any other way to cache data in df ? df = spark.read .format("jdbc")\ .option("url","---------------------------")\ .option("driver","com.sap.db.jdbc.Driver") .option("CharSet","iso_1")\ .option("user","---------------------------")\ .option("password", "---------------------------")\ .option("dbtable","(select * from schema.table_name ) tmp ")\ .load() df.cache()

Online	Offline
Last Visited	‎07-09-2019 11:45 AM

Member Since	‎11-29-2017 02:32 PM
Last Visited	‎07-09-2019 11:45 AM
Posts	2

Cloudera Community

df.cache() is not working on jdbc table