Support Questions

allen_chu · ‎11-25-2024

In Apache Spark, spark_shuffle and spark2_shuffle are configuration options related to Spark's shuffle operations, which can be set to start auxiliary services within the Yarn NodeManager. But what is the difference between these two?

Shelton · ‎12-18-2024

@allen_chu
Maybe I didn't understand the question well but here are the differences and explanation to help you understand and configure the 2 options correctly

1. Difference Between spark_shuffle and spark2_shuffle

spark_shuffle

Used for Apache Spark 1.x versions.
Refers to the shuffle service for older Spark releases that rely on the original shuffle mechanism.
Declared in YARN Node Manager's configuration (yarn-site.xml):

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle</value>
</property>

spark2_shuffle

Introduced for Apache Spark 2.x and later versions.
Handles shuffle operations for newer Spark versions, which have an updated shuffle mechanism with better performance and scalability.
Declared similarly in yarn-site.xml:

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark2_shuffle</value>
</property>

2. Why Two Shuffle Services?

Backward Compatibility: spark_shuffle is retained for Spark 1.x jobs to continue running without modifications.
Separate Service: spark2_shuffle ensures that jobs running on Spark 2.x+ use an optimized and compatible shuffle service without interfering with Spark 1.x jobs.
Upgrade Path: In clusters supporting multiple Spark versions, both shuffle services may coexist to support jobs submitted using Spark 1.x and Spark 2.x simultaneously.

3. Configuration in YARN

To enable the shuffle service for both versions, configure the NodeManager to start both services:

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle,spark2_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark2_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>

4. Key Points

Use spark_shuffle for jobs running with Spark 1.x.
Use spark2_shuffle for jobs running with Spark 2.x or later.

In modern setups, spark2_shuffle is the primary shuffle service since Spark 1.x is largely deprecated.

Happy hadooping

View solution in original post

Shelton · ‎12-18-2024

@allen_chu
Maybe I didn't understand the question well but here are the differences and explanation to help you understand and configure the 2 options correctly

1. Difference Between spark_shuffle and spark2_shuffle

spark_shuffle

Used for Apache Spark 1.x versions.
Refers to the shuffle service for older Spark releases that rely on the original shuffle mechanism.
Declared in YARN Node Manager's configuration (yarn-site.xml):

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle</value>
</property>

spark2_shuffle

Introduced for Apache Spark 2.x and later versions.
Handles shuffle operations for newer Spark versions, which have an updated shuffle mechanism with better performance and scalability.
Declared similarly in yarn-site.xml:

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark2_shuffle</value>
</property>

2. Why Two Shuffle Services?

Backward Compatibility: spark_shuffle is retained for Spark 1.x jobs to continue running without modifications.
Separate Service: spark2_shuffle ensures that jobs running on Spark 2.x+ use an optimized and compatible shuffle service without interfering with Spark 1.x jobs.
Upgrade Path: In clusters supporting multiple Spark versions, both shuffle services may coexist to support jobs submitted using Spark 1.x and Spark 2.x simultaneously.

3. Configuration in YARN

To enable the shuffle service for both versions, configure the NodeManager to start both services:

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle,spark2_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark2_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>

4. Key Points

Use spark_shuffle for jobs running with Spark 1.x.
Use spark2_shuffle for jobs running with Spark 2.x or later.

In modern setups, spark2_shuffle is the primary shuffle service since Spark 1.x is largely deprecated.

Happy hadooping

allen_chu · ‎12-19-2024

@Shelton

Thank you for your reply. This information is very helpful.

Cloudera Community

Support Questions

What is the difference between spark_shuffle & spark2_shuffle in yarn.nodemanager.aux-services

1. Difference Between spark_shuffle and spark2_shuffle

spark_shuffle

spark2_shuffle

2. Why Two Shuffle Services?

1. Difference Between spark_shuffle and spark2_shuffle

spark_shuffle

spark2_shuffle

2. Why Two Shuffle Services?