Processor clock speed has increased dramatically in the recent years, however, the CPU needs to be supplied with a large amount of memory bandwidth to use its processing power effectively.
Even a single CPU running a memory-intensive workload, such as a scientific computing application, can be constrained by memory bandwidth.
This problem is amplified on symmetric multiprocessing (SMP) systems, where many processors must compete for bandwidth on the same system bus.
Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor.
NUMA systems have more than one system bus and they can harness large numbers of processors in a single system, linking several small, cost-effective nodes using a high-performance connection.. Each node contains processors and memory, much like a small SMP system.
However, an advanced memory controller allows a node to use memory on all other nodes, creating a single system image.
When a processor accesses memory that does not lie within its own node (remote memory), the data must be transferred over the NUMA connection, which is slower than accessing local memory. Memory access times are not uniform and depend on the location of the memory and the node from which it is accessed, as the technology’s name implies.
Memory Access Overhead
Spark relies heavily on in-memory processing, as such, the CPU use with distant memory access on Spark is a wasteful stall.
Spark applications performance could be improved not only by configuring various Spark parameters and JVM options, but also using the operating system side optimizations, e.g. CPU affinity, NUMA policy, hardware performance policy etc. to take advantage of the most recent hardware NUMA capable.
Spark launches executor JVMs with many task worker threads in a system. Then, the operating system tries to schedule these worker threads to multiple cores. However, the scheduler does not always bind worker threads to the same NUMA node, so several worker threads are often scheduled to distant cores on remote NUMA nodes. Once a thread moves to another core over NUMA, the thread has to incur an overhead to access data on remote memory. Since Spark’s compute intensive workloads such as machine learning continue to compute many times for the same RDD dataset in memory, the remote memory access overhead is not negligible in total.
Setting NUMA aware affinity for tasks is one well known approach. This means to apply NUMA aware process affinity to the executor JVMs on the same system and compared the computation performance with multiple Spark workloads.
Setting NUMA aware locality for executor JVMs achieves better performance in many Spark applications, enabling core bindings while launching executor JVMs.
Simultaneous multi-threading and hardware prefetching are effective ways to hide data access latencies and additional
latency over-head due to accesses to remote memory can be
removed by co-locating the computations with data they access
on the same socket.
For example, if you start a separate pinned JVM for each NUMA node and have them talk to each other using Akka and assuming you start Spark with executor-cores = 32 (8 virtual cores x 4 sockets), the wasteful stall is still there. A good trick is to start 4 workers per machine, each with executor-cores = 8 instead. Then you could pin these executors to the nodes.
This setup will incur more communication overhead, but will likely be a good trade-off. Spark tries to minimize communication between executors, since they are on different machines in the typical case.
How much performance gain is achievable by colocating the data and computations on NUMA nodes for in-memory data analytics with Spark?