So we have an OOM issue with a reasonble sized dataset which is created by joining two others. This data is needed by all executors in full, so the plan was to build it, collect it and then broadcast it. It OOM'd during collect.
On a development machine I took the first, small dataset, taken from a 30Mb CSV file with 29,000 rows. An array field is exploded to give 210,000 rows. Calling "count()" works fine. Calling show() works fine, although truncated. JVM at this stage has allocated 1Gb of RAM.
However calling collect or collectAsList and JVM allocates the full 6Gb assigned to it and them OOMs "Java Heap Space"
To me it feels as if something is VERY wrong if Spark can't collect 210,000 rows of data which (as a CVS file would amount to around 210Mb) when given 6Gb of RAM. Especially when it was just able to count() and show() said Dataset in the same process with less than 1Gb of RAM.
This is before we convert the generic "Dataset<Row>" to any serialisable class, like Dataset<Pojo>.
How can it possibly take that much RAM to collect a dataset that fits into much less than 1Gb in Spark memory?