New Contributor
Posts: 3
Registered: ‎01-05-2018

Spark Cluster - Difficulty in getting spark sessions at the same time.

I have a single node cluster that is being configured with Cloudera distribution CDH/Spark. Whey my application is submitting a job to the spark cluster (via Livy) the execution starts and goes on for 15 mins. What I have realized is that, during this duration of 15 mins if I open a spark-shell session then I don't get handle of the spark session. I guess, multiple users are not able to get resources from spark cluster. Could you please suggest what needs to be done in order to get rid of this issue.
Posts: 1,695
Kudos: 341
Solutions: 264
Registered: ‎07-31-2013

Re: Spark Cluster - Difficulty in getting spark sessions at the same time.

Given a single node, you're lacking available resources to run concurrent containers required for YARN apps.

Note that most YARN-based apps (Spark, MR2) will functionally consume at least 2 container worth of resources to 'execute' fully: One container for the Application Master, and another for an Executor (Spark) or Task (MR2).

Your single NodeManager publishes only a limited amount of resources that containers can consume. For ex., 4 vCores and 16 GiB memory can be consumed by a single running Spark application if it requests 2 vCores for the App Master and 1 vCore for the Executor, leaving just 1 vCore free for other containers. When the second app is submitted, 2 vCores required for its App Master cannot be gained, so it goes into a resource wait mode until the first application can release its occupied resources.

If multi-user is truly your goal here, then consider adding more nodes into your cluster. This problem does not usually appear in a proper (3-5+ hosts) cluster.

You can also work around the limitation on your 1-node cluster to emit a false/larger number of resources by the NodeManager, such as a hundred vCores and corresponding memory increase, which will let you achieve the concurrency but will sacrifice performance and may make your host unresponsive due to overload of accepted applications.