I'm trying to run a SparkSQL query which reads data from a Hive-table, and it fails when I get above a certain threshold. I run the command:
val 500k = spark.sql("""select myid, otherfield, count(*) as cnt from mytable group by otherfield, myid order by cnt desc limit 500000""").cache();
with the 500k rows being sort of the magic number. If I go higher the task fails due to the error:
15:02:05 WARN MemoryStore: Not enough space to cache rdd_52_0 in memory! (computed 2046.7 MB GB so far) 15:02:05 INFO MemoryStore: Memory use = 624.7 KB (blocks) + 2043.5 MB (scratch space shared across 1 tasks(s)) = 2044.1 MB. Storage limit = 2.7 GB. 15:02:05 WARN CacheManager: Persisting partition rdd_52_0 to disk instead. 15:17:56 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 24002) java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
In the WebUI under storage I can see the rdd_52_0 with size 3.1 GB after written to disk.
After consulting https://stackoverflow.com/questions/43644247/understanding-spark-partitioning and https://stackoverflow.com/questions/35863441/why-i-got-the-error-size-exceed-integer-max-value-when-... I get that the overflow is due to the rdd being to big. In the WebUI only 1 rdd is shown, which is the problem - how can I force the number of cached rdds up after the shuffle?
I have tried repartitioning it with 500k.repartition(100), I have increased the number of shufflePartitions with spark.sessionstate.conf.setConf(SHUFFLE_PARTITIONS, 100) and I have increased both driver and executor memory to 16 GB.
Also where is the 2.7 GB limit coming from?
You are running into the following : https://issues.apache.org/jira/browse/SPARK-5928
To resolve this you will need to increase the paritions that you are using; by doing that you will be able to decrease your parition size to not exceed 2GB.
Try submitting your job via the command line with
View the statistics in the Executors tab on the Spark History server to see what sizes are being showing for the executors.
Thank you for your answer.
I have tried your suggestion, with the unfortunate lack of change compared to the things I tried previously. It simply ignores it, and still creates a single rdd that exceeds the 2G limit.
Any way to force it?