Support Questions

pradeepbill · ‎06-12-2018

why is the spark shuffle stage is so slow for 1.6 MB shuffle write, and 2.4 MB input?.Also why is the shuffle write happening only on one executor ?.I am running a 3 node cluster with 8 cores each.

Please see my code and Spark UI pictures below

Code:

<code>JavaPairRDD<String, String> javaPairRDD = c.mapToPair(new PairFunction<String, String, String>() {
    @Override
    public Tuple2<String, String> call(String arg0) throws Exception {
        // TODO Auto-generated method stub

        try {
            if (org.apache.commons.lang.StringUtils.isEmpty(arg0)) {
                return new Tuple2<String, String>("", "");
            }
            Tuple2<String, String> t = new Tuple2<String, String>(getESIndexName(arg0), arg0);
            return t;
        } catch (Exception e) {
            e.printStackTrace();
            System.out.println("******* exception in getESIndexName");
        }
        return new Tuple2<String, String>("", "");
    }
});

java.util.Map<String, Iterable<String>> map1 = javaPairRDD.groupByKey().collectAsMap();*

vmurakami · ‎06-12-2018

Hey @pradeep arumalla!
I'm not a specialist in coding or spark, but did you tried to change your groupByKey for reduceByKey (at lhe last line)?

And about the executors --num-executors, how are you launching your job, is it by spark-submit? Could you share with us?

BTW: here's some links about shuffling 🙂
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-shuffle.html
https://0x0fff.com/spark-architecture-shuffle/

Hope this helps!

linehrr · ‎04-11-2019

this looks like you have data skew issue, meaning your group by key is skewed, resulting in unbalanced data between partitions. you can inspect your key distribution, if skewness is real, you need to change key or add salt into the groupby so data can be evenly distributed.

Cloudera Community

Support Questions

spark job shuffle write super slow