Support Questions

Find answers, ask questions, and share your expertise

What is the difference between spark_shuffle & spark2_shuffle in yarn.nodemanager.aux-services

avatar
Explorer

In Apache Spark, spark_shuffle and spark2_shuffle are configuration options related to Spark's shuffle operations, which can be set to start auxiliary services within the Yarn NodeManager. But what is the difference between these two?

1 ACCEPTED SOLUTION

avatar
Master Mentor

@allen_chu 
Maybe I didn't understand the question well but here are the differences and explanation to help you understand and configure the 2 options correctly

1. Difference Between spark_shuffle and spark2_shuffle

spark_shuffle

  • Used for Apache Spark 1.x versions.
  • Refers to the shuffle service for older Spark releases that rely on the original shuffle mechanism.
  • Declared in YARN Node Manager's configuration (yarn-site.xml):
     

 

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle</value>
</property>

 

spark2_shuffle

  • Introduced for Apache Spark 2.x and later versions.
  • Handles shuffle operations for newer Spark versions, which have an updated shuffle mechanism with better performance and scalability.
  • Declared similarly in yarn-site.xml:
     
     

 

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark2_shuffle</value>
</property>

 

 

2. Why Two Shuffle Services?

  • Backward Compatibility: spark_shuffle is retained for Spark 1.x jobs to continue running without modifications.
  • Separate Service: spark2_shuffle ensures that jobs running on Spark 2.x+ use an optimized and compatible shuffle service without interfering with Spark 1.x jobs.
  • Upgrade Path: In clusters supporting multiple Spark versions, both shuffle services may coexist to support jobs submitted using Spark 1.x and Spark 2.x simultaneously.

3. Configuration in YARN

To enable the shuffle service for both versions, configure the NodeManager to start both services:

 

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle,spark2_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark2_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>

 

4. Key Points

  • Use spark_shuffle for jobs running with Spark 1.x.
  • Use spark2_shuffle for jobs running with Spark 2.x or later.

In modern setups, spark2_shuffle is the primary shuffle service since Spark 1.x is largely deprecated.

Happy hadooping

View solution in original post

2 REPLIES 2

avatar
Master Mentor

@allen_chu 
Maybe I didn't understand the question well but here are the differences and explanation to help you understand and configure the 2 options correctly

1. Difference Between spark_shuffle and spark2_shuffle

spark_shuffle

  • Used for Apache Spark 1.x versions.
  • Refers to the shuffle service for older Spark releases that rely on the original shuffle mechanism.
  • Declared in YARN Node Manager's configuration (yarn-site.xml):
     

 

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle</value>
</property>

 

spark2_shuffle

  • Introduced for Apache Spark 2.x and later versions.
  • Handles shuffle operations for newer Spark versions, which have an updated shuffle mechanism with better performance and scalability.
  • Declared similarly in yarn-site.xml:
     
     

 

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark2_shuffle</value>
</property>

 

 

2. Why Two Shuffle Services?

  • Backward Compatibility: spark_shuffle is retained for Spark 1.x jobs to continue running without modifications.
  • Separate Service: spark2_shuffle ensures that jobs running on Spark 2.x+ use an optimized and compatible shuffle service without interfering with Spark 1.x jobs.
  • Upgrade Path: In clusters supporting multiple Spark versions, both shuffle services may coexist to support jobs submitted using Spark 1.x and Spark 2.x simultaneously.

3. Configuration in YARN

To enable the shuffle service for both versions, configure the NodeManager to start both services:

 

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle,spark_shuffle,spark2_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.spark2_shuffle.class</name>
    <value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>

 

4. Key Points

  • Use spark_shuffle for jobs running with Spark 1.x.
  • Use spark2_shuffle for jobs running with Spark 2.x or later.

In modern setups, spark2_shuffle is the primary shuffle service since Spark 1.x is largely deprecated.

Happy hadooping

avatar
Explorer

@Shelton 

Thank you for your reply. This information is very helpful.