Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark sql cache partitions

Spark sql cache partitions

New Contributor



I'm trying to run a SparkSQL query which reads data from a Hive-table, and it fails when I get above a certain threshold. I run the command:

val 500k = spark.sql("""select myid, otherfield, count(*) as cnt from mytable
group by otherfield, myid order by cnt desc limit 500000""").cache();;

with the 500k rows being sort of the magic number. If I go higher the task fails due to the error:

15:02:05 WARN MemoryStore: Not enough space to cache rdd_52_0 in memory! (computed 2046.7 MB GB so far)
15:02:05 INFO MemoryStore: Memory use = 624.7 KB (blocks) + 2043.5 MB (scratch space shared across 1 tasks(s)) = 2044.1 MB. Storage limit = 2.7 GB.
15:02:05 WARN CacheManager: Persisting partition rdd_52_0 to disk instead.
15:17:56 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 24002)
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

In the WebUI under storage I can see the rdd_52_0 with size 3.1 GB after written to disk.

After consulting and I get that the overflow is due to the rdd being to big. In the WebUI only 1 rdd is shown, which is the problem - how can I force the number of cached rdds up after the shuffle?

I have tried repartitioning it with 500k.repartition(100), I have increased the number of shufflePartitions with spark.sessionstate.conf.setConf(SHUFFLE_PARTITIONS, 100) and I have increased both driver and executor memory to 16 GB.

Also where is the 2.7 GB limit coming from?


Thank you!


Re: Spark sql cache partitions


Hi jmoriarty


You are running into the following :


To resolve this you will need to increase the paritions that you are using; by doing that you will be able to decrease your parition size to not exceed 2GB. 


Try submitting your job via the command line with 

--conf spark.sql.shuffle.partitions=2000


View the statistics in the Executors tab on the Spark History server to see what sizes are being showing for the executors. 




Re: Spark sql cache partitions

New Contributor

Hi @Borg


Thank you for your answer.

I have tried your suggestion, with the unfortunate lack of change compared to the things I tried previously. It simply ignores it, and still creates a single rdd that exceeds the 2G limit.


Any way to force it?