Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

spark job shuffle write super slow

Highlighted

spark job shuffle write super slow

New Contributor

76637-untitled2.png

76646-untitled2.png

why is the spark shuffle stage is so slow for 1.6 MB shuffle write, and 2.4 MB input?.Also why is the shuffle write happening only on one executor ?.I am running a 3 node cluster with 8 cores each.

Please see my code and Spark UI pictures below

Code:

<code>JavaPairRDD<String, String> javaPairRDD = c.mapToPair(new PairFunction<String, String, String>() {
    @Override
    public Tuple2<String, String> call(String arg0) throws Exception {
        // TODO Auto-generated method stub

        try {
            if (org.apache.commons.lang.StringUtils.isEmpty(arg0)) {
                return new Tuple2<String, String>("", "");
            }
            Tuple2<String, String> t = new Tuple2<String, String>(getESIndexName(arg0), arg0);
            return t;
        } catch (Exception e) {
            e.printStackTrace();
            System.out.println("******* exception in getESIndexName");
        }
        return new Tuple2<String, String>("", "");
    }
});

java.util.Map<String, Iterable<String>> map1 = javaPairRDD.groupByKey().collectAsMap();* 

76636-untitled1.png

2 REPLIES 2

Re: spark job shuffle write super slow

Hey @pradeep arumalla!
I'm not a specialist in coding or spark, but did you tried to change your groupByKey for reduceByKey (at lhe last line)?

And about the executors --num-executors, how are you launching your job, is it by spark-submit? Could you share with us?


BTW: here's some links about shuffling :)
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-shuffle.html
https://0x0fff.com/spark-architecture-shuffle/

Hope this helps!

Re: spark job shuffle write super slow

New Contributor

this looks like you have data skew issue, meaning your group by key is skewed, resulting in unbalanced data between partitions. you can inspect your key distribution, if skewness is real, you need to change key or add salt into the groupby so data can be evenly distributed.