Member since
09-14-2015
3
Posts
0
Kudos Received
0
Solutions
09-14-2015
09:53 AM
In your Spark UI do you see it working with a large number of partitions (large number of tasks)? It could be that you are loading all 70G into memory at once if you have a small number of partitions. Also it could be that you have one huge partition with 99% of the data and lots of small ones. Then when Spark processes your huge partition it will load it all into memory. This can happen if you are mapping to a tuple e.g. (x, y) and the key (x) is the same for 99% of the data. Have a look at your Spark UI to see the size of the tasks you are running. It's likely that you will see a small number of tasks, or one huge task and a lot of small ones.
... View more