Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

spark job shuffle write super slow

avatar
Contributor

76637-untitled2.png

76646-untitled2.png

why is the spark shuffle stage is so slow for 1.6 MB shuffle write, and 2.4 MB input?.Also why is the shuffle write happening only on one executor ?.I am running a 3 node cluster with 8 cores each.

Please see my code and Spark UI pictures below

Code:

<code>JavaPairRDD<String, String> javaPairRDD = c.mapToPair(new PairFunction<String, String, String>() {
    @Override
    public Tuple2<String, String> call(String arg0) throws Exception {
        // TODO Auto-generated method stub

        try {
            if (org.apache.commons.lang.StringUtils.isEmpty(arg0)) {
                return new Tuple2<String, String>("", "");
            }
            Tuple2<String, String> t = new Tuple2<String, String>(getESIndexName(arg0), arg0);
            return t;
        } catch (Exception e) {
            e.printStackTrace();
            System.out.println("******* exception in getESIndexName");
        }
        return new Tuple2<String, String>("", "");
    }
});

java.util.Map<String, Iterable<String>> map1 = javaPairRDD.groupByKey().collectAsMap();* 

76636-untitled1.png

2 REPLIES 2

avatar

Hey @pradeep arumalla!
I'm not a specialist in coding or spark, but did you tried to change your groupByKey for reduceByKey (at lhe last line)?

And about the executors --num-executors, how are you launching your job, is it by spark-submit? Could you share with us?


BTW: here's some links about shuffling 🙂
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-shuffle.html
https://0x0fff.com/spark-architecture-shuffle/

Hope this helps!

avatar
Contributor

this looks like you have data skew issue, meaning your group by key is skewed, resulting in unbalanced data between partitions. you can inspect your key distribution, if skewness is real, you need to change key or add salt into the groupby so data can be evenly distributed.