Member since
11-29-2017
2
Posts
0
Kudos Received
0
Solutions
07-18-2018
08:08 AM
I am creating a dataframe using pyspark sql jdbc.read(). I want to cache the data read from jdbc table into a df to use it further in joins and agg. By using df.cache() I cannot see any query in rdbms executed for reading data unless I do df.show(). It means that data is not cached yet. Whenever I am using this cached df in further joins and unions, each time a SELECT is executed in rdbms which is not expected and needs to be reduced. What could be the possible reason for this behaviour. Is there any other way to cache data in df ? df = spark.read
.format("jdbc")\
.option("url","---------------------------")\
.option("driver","com.sap.db.jdbc.Driver")
.option("CharSet","iso_1")\
.option("user","---------------------------")\
.option("password", "---------------------------")\
.option("dbtable","(select * from schema.table_name ) tmp ")\
.load()
df.cache()
... View more
Labels:
- Labels:
-
Apache Spark