Community Articles

cstanca · ‎12-18-2016

Background

Tungsten became the default in Spark 1.5 and can be enabled in earlier versions by setting spark.sql.tungsten.enabled to true (or disabled in later versions by setting this to false). Even without Tungsten, Spark SQL uses a columnar storage format with Kryo serialization to minimize storage cost.

Goal

The goal of Project Tungsten is to improve Spark execution by optimizing Spark jobs for CPU and memory efficiency (as opposed to network and disk I/O which are considered fast enough).

Scope

Tungsten focuses on the hardware architecture of the platform Spark runs on, including but not limited to JVM, LLVM, GPU, NVRAM, etc.

Optimization Features

Off-Heap Memory Management using binary in-memory data representation aka Tungsten row format and managing memory explicitly,
Cache Locality which is about cache-aware computations with cache-aware layout for high cache hit rates,
Whole-Stage Code Generation (aka CodeGen).

Design Improvements

Tungsten includes specialized in-memory data structures tuned for the type of operations required by Spark, improved code generation, and a specialized wire protocol.
Tungsten’s representation is substantially smaller than objects serialized using Java or even Kryo serializers.
As Tungsten does not depend on Java objects, both on-heap and off-heap allocations are supported. Not only is the format more compact, serialization times can be substantially faster than with native serialization. Since Tungsten no longer depends on working with Java objects, you can use either on-heap (in the JVM) or off-heap storage. If you use off-heap storage, it is important to leave enough room in your containers for the off-heap allocations - which you can get an approximate idea for from the web ui.
Tungsten’s data structures are also created closely in mind with the kind of processing for which they are used. The classic example of this is with sorting, a common and expensive operation. The on-wire representation is implemented so that sorting can be done without having to deserialize the data again.
By avoiding the memory and GC overhead of regular Java objects, Tungsten is able to process larger data sets than the same hand-written aggregations.

Benefits

The following Spark jobs will benefit from Tungsten:

Dataframes: Java, Scala, Python, R
SparkSQL queries
Some RDD API programs via general serialization and compression optimizations

Next Steps

In the future Tungsten may make it more feasible to use certain non-JVM libraries. For many simple operations the cost of using BLAS, or similar linear algebra packages, from the JVM is dominated by the cost of copying the data off-heap.

References:

Project Tungsten: Bringing Apache Spark Closer to Bare Metal
High Performance Spark by Holden Karau; Rachel Warren
Slides: Deep Dive Into Project Tungsten - Josh Rosen
Video: Deep Dive into Project Tungsten Bringing Spark Closer to Bare Metal -Josh Rosen (Databricks)

Cloudera Community

Community Articles

What is Tungsten for Apache Spark?

Apache Spark

Background

Goal

Scope

Optimization Features

Design Improvements

Benefits

Next Steps

References:

Spark Geospatial with Apache Sedona in Cloudera Da...

Parsing Apache Log Files with Spark

Apache Zeppelin (Hive & Spark Demo)

Apache Spark - Apache HBase Connector

HDF 3.1: Executing Apache Spark via ExecuteSparkIn...

Horses for Courses: Apache Spark Streaming and Apa...

HDP 2.6.4 - HDF 3.1: Apache Spark Streaming Inte...

[ANNOUNCE] Updates to the dbt adapters for Apache ...

Data Ingest with Apache Zeppelin + Apache Spark 1....

HDP 2.6.4 - HDF 3.1: Apache Kafka - Apache Spark S...