Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Small dataset, show() works, count() works, but collect/collectAsList uses over 6Gb of RAM and OOMs.

Small dataset, show() works, count() works, but collect/collectAsList uses over 6Gb of RAM and OOMs.

New Contributor

So we have an OOM issue with a reasonble sized dataset which is created by joining two others.  This data is needed by all executors in full, so the plan was to build it, collect it and then broadcast it.  It OOM'd during collect.

 

On a development machine I took the first, small dataset, taken from a 30Mb CSV file with 29,000 rows.  An array field is exploded to give 210,000 rows.  Calling "count()" works fine.  Calling show() works fine, although truncated.  JVM at this stage has allocated 1Gb of RAM.

 

However calling collect or collectAsList and JVM allocates the full 6Gb assigned to it and them OOMs "Java Heap Space"

 

To me it feels as if something is VERY wrong if Spark can't collect 210,000 rows of data which (as a CVS file would amount to around 210Mb) when given 6Gb of RAM.  Especially when it was just able to count() and show() said Dataset in the same process with less than 1Gb of RAM.

 

This is before we convert the generic "Dataset<Row>" to any serialisable class, like Dataset<Pojo>.

 

How can it possibly take that much RAM to collect a dataset that fits into much less than 1Gb in Spark memory?

1 REPLY 1
Highlighted

Re: Small dataset, show() works, count() works, but collect/collectAsList uses over 6Gb of RAM and O

New Contributor
So, small update.

The input CSV file is actually only 9Mb.

If I just load it, split one column into an array<string> and call
show(10000, false)

Spark allocates 6Gb and OOMs.

This can't be normal.