Support Questions

Find answers, ask questions, and share your expertise

Spark sql cache partitions

avatar
New Contributor

Hello

 

I'm trying to run a SparkSQL query which reads data from a Hive-table, and it fails when I get above a certain threshold. I run the command:

val 500k = spark.sql("""select myid, otherfield, count(*) as cnt from mytable
group by otherfield, myid order by cnt desc limit 500000""").cache();

500k.show();


with the 500k rows being sort of the magic number. If I go higher the task fails due to the error:

15:02:05 WARN MemoryStore: Not enough space to cache rdd_52_0 in memory! (computed 2046.7 MB GB so far)
15:02:05 INFO MemoryStore: Memory use = 624.7 KB (blocks) + 2043.5 MB (scratch space shared across 1 tasks(s)) = 2044.1 MB. Storage limit = 2.7 GB.
15:02:05 WARN CacheManager: Persisting partition rdd_52_0 to disk instead.
15:17:56 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 24002)
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE


In the WebUI under storage I can see the rdd_52_0 with size 3.1 GB after written to disk.

After consulting https://stackoverflow.com/questions/43644247/understanding-spark-partitioning and https://stackoverflow.com/questions/35863441/why-i-got-the-error-size-exceed-integer-max-value-when-... I get that the overflow is due to the rdd being to big. In the WebUI only 1 rdd is shown, which is the problem - how can I force the number of cached rdds up after the shuffle?

I have tried repartitioning it with 500k.repartition(100), I have increased the number of shufflePartitions with spark.sessionstate.conf.setConf(SHUFFLE_PARTITIONS, 100) and I have increased both driver and executor memory to 16 GB.

Also where is the 2.7 GB limit coming from?

 

Thank you!

2 REPLIES 2

avatar
Rising Star

Hi jmoriarty

 

You are running into the following : https://issues.apache.org/jira/browse/SPARK-5928

 

To resolve this you will need to increase the paritions that you are using; by doing that you will be able to decrease your parition size to not exceed 2GB. 

 

Try submitting your job via the command line with 

--conf spark.sql.shuffle.partitions=2000

 

View the statistics in the Executors tab on the Spark History server to see what sizes are being showing for the executors. 

 

Thanks, 

Jordan  

avatar
New Contributor

Hi @Borg

 

Thank you for your answer.

I have tried your suggestion, with the unfortunate lack of change compared to the things I tried previously. It simply ignores it, and still creates a single rdd that exceeds the 2G limit.

 

Any way to force it?

Thanks