About alrocks

alrocks · ‎09-14-2015

In your Spark UI do you see it working with a large number of partitions (large number of tasks)? It could be that you are loading all 70G into memory at once if you have a small number of partitions. Also it could be that you have one huge partition with 99% of the data and lots of small ones. Then when Spark processes your huge partition it will load it all into memory. This can happen if you are mapping to a tuple e.g. (x, y) and the key (x) is the same for 99% of the data. Have a look at your Spark UI to see the size of the tasks you are running. It's likely that you will see a small number of tasks, or one huge task and a lot of small ones.

alrocks · ‎09-14-2015

Thanks Sean; you're the best!

alrocks · ‎09-14-2015

As mentioned in existing posts you can run Spark 1.4 and 1.5 on Cloudera 5.4 and it will mostly (if not completely) work. What is the Cloudera stance on supporting this activity? Will Cloudera provide any Spark support to a company that uses a newer Spark version on Cloudera 5.4?

Online	Offline
Last Visited	‎10-22-2015 09:32 AM

Member Since	‎09-14-2015 09:00 AM
Last Visited	‎10-22-2015 09:32 AM
Posts	3

Cloudera Community

Re: Spark not working when I'm using a big dataset

Re: Cloudera Spark 1.4 Support in CDH 5.4

Cloudera Spark 1.4+ Support in CDH 5.4