Support Questions

amcbarnett · ‎12-09-2015

Is it a good idea to have separate nodes just for Spark and node label them? I'm not sure if this is best practice. I can see why since according to http://spark.apache.org/docs/latest/hardware-prov... Spark should be allocated 75% of memory. But with YARN this may not be needed right?

stevel · ‎12-13-2015

If you are running spark applications on a YARN cluster then you do not need to directly allocate memory or machines to it.

You can dedicate machines via labels, either for exclusive workloads or

to handle heterogenous hardware better. If there is some application where latency and the ability to respond immediately to spikes in load matters, then dedicated labels work. For example; HBase in interactive applications. If different parts of the cluster have different hardware configurations (example: RAM, GPU, SSD for local storage), then labels helps you schedule jobs which need those features to only be executed on those machines

Once you start using labels, the labelled hosts will be underutilized when that specific work isn't running: the permanent tradeoff.

If you are just running queries on a cluster where that latency isn't so critical that you want to pre-allocate capacity on isolated machines, —then using queues makes is more efficient.

You can also set up queue priorities and pre-emption, so your important spark queries can actually pre-empt (i.e. kill) ongoing work from lower-priority applications.

What is important for Spark is having your jobs ask for the memory they really need: Spark likes a lot, and if the spark JVM/python code consumes more than was allocated to them in the Yarn container requests, the processes may get killed.

View solution in original post

nsabharwal · ‎12-09-2015

@Ancil McBarnett

If cluster is going to be heavily used for spark then definitely good idea to allocate dedicated resources to spark components. Also, make sure that spark client has enough memory too.

stevel · ‎12-13-2015

I must disagree. Dedicating machines via labels is not always the right choice. Imagine you give 20 nodes in a 100 node cluster the label "spark", with only spark-queue work able to run on it. When there's no work on that queue: the machines are idle. When there is work in the queue, it'll only get run on those 20 nodes.

There's also replication & data locality to consider: if the data you need isn't on one of those 20 nodes, it'll be remote —which can also hurt performance.

You really need to look at the cluster and workload to make a good choice

nsabharwal · ‎12-13-2015

@stevel I agree with you and that's why I did not mention labeling.

stevel · ‎12-13-2015