New Contributor
Posts: 1
Registered: ‎10-30-2018

Spark sql performance issues with joins even after caching the data frame.


I am relatively new to Spark so please pardon me if I am asking something basic.

I am running heavy batch process with lots spark sql queries with joins.


My queries are too big, so I can't post here, but I will try to explain what is happening here, hope it make sense.


I have order (hive) table, partitioned by date and a pid column.


In my queries, I pass the date value, but I don't pass pid. all queries join by date and pid between them.



I am doing below in spark code:

1. Create dataset orderDataset ( select * from order where date='2018-10-30'

2. orderDataset.createOrReplaceTempView("orderTable")

3. orderDataset.cache()



Now I am joining orderTable in several queries. Atleast in 4 different queries.


This order table has 5000 files under different partitions in s3.


Now the problem is, when I ran the code, spark is issuing 5000 taks every time it encounters this table in the joins.

18/10/30 21:37:55 INFO TaskSetManager: Finished task 121.0 in stage 14.0 (TID 352) in 8 ms on hostname (executor 3) (123/5000)

18/10/30 21:37:55 INFO TaskSetManager: Finished task 122.0 in stage 14.0 (TID 352) in 8 ms on hostname (executor 3) (124/5000)

18/10/30 21:37:55 INFO TaskSetManager: Finished task 123.0 in stage 14.0 (TID 352) in 8 ms on hostname (executor 3) (125/5000)



Meaning, I see order table files getting loaded multiple times in the same executor at different times.



I am seeing below log entries in executortor log many times, my guess is that all the files are loaded each time I use order table in several joins. in my example,  all 5000 files are loaded each time order table is joined.


18/10/30 21:37:46 INFO rdd.HadoopRDD: Input split: s3a://<path>/order/txdate=2015-12-17/pid=1/fil1.parquet
18/10/30 21:37:50 INFO rdd.HadoopRDD: Input split: s3a://<path>/order/txdate=2015-12-17/pid=1/fil2.parquet
18/10/30 21:37:52 INFO rdd.HadoopRDD: Input split: s3a://<path>/order/txdate=2015-12-17/pid=1/fil3.parquet
18/10/30 21:37:54 INFO rdd.HadoopRDD: Input split: s3a://<path>/order/txdate=2015-12-17/pid=1/fil4.parquet




My Questions is why the cache is not getting used and why the files are loaded many times? This is drastically impacting performance of my application.


I could be doing some thing fundamentally wrong. Any suggestions would greatly appreciated.