Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark sql cache partitions

Spark sql cache partitions

New Contributor

Hello

 

I'm trying to run a SparkSQL query which reads data from a Hive-table, and it fails when I get above a certain threshold. I run the command:

val 500k = spark.sql("""select myid, otherfield, count(*) as cnt from mytable
group by otherfield, myid order by cnt desc limit 500000""").cache();

500k.show();


with the 500k rows being sort of the magic number. If I go higher the task fails due to the error:

15:02:05 WARN MemoryStore: Not enough space to cache rdd_52_0 in memory! (computed 2046.7 MB GB so far)
15:02:05 INFO MemoryStore: Memory use = 624.7 KB (blocks) + 2043.5 MB (scratch space shared across 1 tasks(s)) = 2044.1 MB. Storage limit = 2.7 GB.
15:02:05 WARN CacheManager: Persisting partition rdd_52_0 to disk instead.
15:17:56 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 24002)
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE


In the WebUI under storage I can see the rdd_52_0 with size 3.1 GB after written to disk.

After consulting https://stackoverflow.com/questions/43644247/understanding-spark-partitioning and https://stackoverflow.com/questions/35863441/why-i-got-the-error-size-exceed-integer-max-value-when-... I get that the overflow is due to the rdd being to big. In the WebUI only 1 rdd is shown, which is the problem - how can I force the number of cached rdds up after the shuffle?

I have tried repartitioning it with 500k.repartition(100), I have increased the number of shufflePartitions with spark.sessionstate.conf.setConf(SHUFFLE_PARTITIONS, 100) and I have increased both driver and executor memory to 16 GB.

Also where is the 2.7 GB limit coming from?

 

Thank you!

2 REPLIES 2

Re: Spark sql cache partitions

Contributor

Hi jmoriarty

 

You are running into the following : https://issues.apache.org/jira/browse/SPARK-5928

 

To resolve this you will need to increase the paritions that you are using; by doing that you will be able to decrease your parition size to not exceed 2GB. 

 

Try submitting your job via the command line with 

--conf spark.sql.shuffle.partitions=2000

 

View the statistics in the Executors tab on the Spark History server to see what sizes are being showing for the executors. 

 

Thanks, 

Jordan  

Highlighted

Re: Spark sql cache partitions

New Contributor

Hi @Borg

 

Thank you for your answer.

I have tried your suggestion, with the unfortunate lack of change compared to the things I tried previously. It simply ignores it, and still creates a single rdd that exceeds the 2G limit.

 

Any way to force it?

Thanks