10-30-2018 04:40 PM
I am relatively new to Spark so please pardon me if I am asking something basic.
I am running heavy batch process with lots spark sql queries with joins.
My queries are too big, so I can't post here, but I will try to explain what is happening here, hope it make sense.
I have order (hive) table, partitioned by date and a pid column.
In my queries, I pass the date value, but I don't pass pid. all queries join by date and pid between them.
I am doing below in spark code:
1. Create dataset orderDataset ( select * from order where date='2018-10-30'
Now I am joining orderTable in several queries. Atleast in 4 different queries.
This order table has 5000 files under different partitions in s3.
Now the problem is, when I ran the code, spark is issuing 5000 taks every time it encounters this table in the joins.
18/10/30 21:37:55 INFO TaskSetManager: Finished task 121.0 in stage 14.0 (TID 352) in 8 ms on hostname (executor 3) (123/5000)
18/10/30 21:37:55 INFO TaskSetManager: Finished task 122.0 in stage 14.0 (TID 352) in 8 ms on hostname (executor 3) (124/5000)
18/10/30 21:37:55 INFO TaskSetManager: Finished task 123.0 in stage 14.0 (TID 352) in 8 ms on hostname (executor 3) (125/5000)
Meaning, I see order table files getting loaded multiple times in the same executor at different times.
I am seeing below log entries in executortor log many times, my guess is that all the files are loaded each time I use order table in several joins. in my example, all 5000 files are loaded each time order table is joined.
18/10/30 21:37:46 INFO rdd.HadoopRDD: Input split: s3a://<path>/order/txdate=2015-12-17/pid=1/fil1.parquet
18/10/30 21:37:50 INFO rdd.HadoopRDD: Input split: s3a://<path>/order/txdate=2015-12-17/pid=1/fil2.parquet
18/10/30 21:37:52 INFO rdd.HadoopRDD: Input split: s3a://<path>/order/txdate=2015-12-17/pid=1/fil3.parquet
18/10/30 21:37:54 INFO rdd.HadoopRDD: Input split: s3a://<path>/order/txdate=2015-12-17/pid=1/fil4.parquet
My Questions is why the cache is not getting used and why the files are loaded many times? This is drastically impacting performance of my application.
I could be doing some thing fundamentally wrong. Any suggestions would greatly appreciated.