Support Questions

Find answers, ask questions, and share your expertise

spark job shuffle write super slow




why is the spark shuffle stage is so slow for 1.6 MB shuffle write, and 2.4 MB input?.Also why is the shuffle write happening only on one executor ?.I am running a 3 node cluster with 8 cores each.

Please see my code and Spark UI pictures below


<code>JavaPairRDD<String, String> javaPairRDD = c.mapToPair(new PairFunction<String, String, String>() {
    public Tuple2<String, String> call(String arg0) throws Exception {
        // TODO Auto-generated method stub

        try {
            if (org.apache.commons.lang.StringUtils.isEmpty(arg0)) {
                return new Tuple2<String, String>("", "");
            Tuple2<String, String> t = new Tuple2<String, String>(getESIndexName(arg0), arg0);
            return t;
        } catch (Exception e) {
            System.out.println("******* exception in getESIndexName");
        return new Tuple2<String, String>("", "");

java.util.Map<String, Iterable<String>> map1 = javaPairRDD.groupByKey().collectAsMap();* 



Hey @pradeep arumalla!
I'm not a specialist in coding or spark, but did you tried to change your groupByKey for reduceByKey (at lhe last line)?

And about the executors --num-executors, how are you launching your job, is it by spark-submit? Could you share with us?

BTW: here's some links about shuffling 🙂

Hope this helps!


this looks like you have data skew issue, meaning your group by key is skewed, resulting in unbalanced data between partitions. you can inspect your key distribution, if skewness is real, you need to change key or add salt into the groupby so data can be evenly distributed.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.