Created on 12-18-201610:17 PM - edited 08-17-201907:15 AM
Background
Tungsten became the default in Spark 1.5 and can be enabled in earlier versions by setting spark.sql.tungsten.enabled to true (or disabled in later versions by setting this to false). Even without Tungsten, Spark SQL uses a columnar storage format with Kryo serialization to minimize storage cost.
Goal
The goal of Project Tungsten is to improve Spark execution by optimizing Spark jobs for CPU and memory efficiency (as opposed to network and disk I/O which are considered fast enough).
Scope
Tungsten focuses on the hardware architecture of the platform Spark runs on, including but not limited to JVM, LLVM, GPU, NVRAM, etc.
Optimization Features
Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly,
Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates,
Whole-Stage Code Generation (aka CodeGen).
Design Improvements
Tungsten includes specialized in-memory data structures
tuned for the type of operations required by Spark, improved code generation,
and a specialized wire protocol.
Tungsten’s representation is substantially smaller than
objects serialized using Java or even Kryo serializers.
As Tungsten does not depend on Java objects, both on-heap
and off-heap allocations are supported. Not only is the format more compact, serialization times can
be substantially faster than with native serialization. Since Tungsten no longer depends on working with Java
objects, you can use either on-heap (in the JVM) or off-heap storage. If you
use off-heap storage, it is important to leave enough room in your containers
for the off-heap allocations - which you can get an approximate idea for from
the web ui.
Tungsten’s data structures are also created closely in mind
with the kind of processing for which they are used. The classic example of this is with sorting, a common and
expensive operation. The on-wire representation is implemented so that sorting
can be done without having to deserialize the data again.
By
avoiding the memory and GC overhead of regular Java objects, Tungsten is able
to process larger data sets than the same hand-written aggregations.
Benefits
The following Spark jobs will benefit from Tungsten:
Dataframes: Java, Scala, Python, R
SparkSQL queries
Some RDD API programs via general serialization and compression optimizations
Next Steps
In the future Tungsten may make it more feasible to use certain non-JVM libraries. For many simple operations the cost of using BLAS, or similar linear algebra packages, from the JVM is dominated by the cost of copying the data off-heap.