About alrocks

alrocks · ‎09-14-2015

In your Spark UI do you see it working with a large number of partitions (large number of tasks)? It could be that you are loading all 70G into memory at once if you have a small number of partitions. Also it could be that you have one huge partition with 99% of the data and lots of small ones. Then when Spark processes your huge partition it will load it all into memory. This can happen if you are mapping to a tuple e.g. (x, y) and the key (x) is the same for 99% of the data. Have a look at your Spark UI to see the size of the tasks you are running. It's likely that you will see a small number of tasks, or one huge task and a lot of small ones.

alrocks · ‎09-14-2015

Thanks Sean; you're the best!

Online	Offline
Last Visited	‎10-22-2015 09:32 AM

Member Since	‎09-14-2015 09:00 AM
Last Visited	‎10-22-2015 09:32 AM
Posts	3

Cloudera Community

Re: Spark not working when I'm using a big dataset

Re: Cloudera Spark 1.4 Support in CDH 5.4