Created on 06-12-2018 02:13 PM - edited 08-17-2019 07:28 PM
why is the spark shuffle stage is so slow for 1.6 MB shuffle write, and 2.4 MB input?.Also why is the shuffle write happening only on one executor ?.I am running a 3 node cluster with 8 cores each.
Please see my code and Spark UI pictures below
Code:
<code>JavaPairRDD<String, String> javaPairRDD = c.mapToPair(new PairFunction<String, String, String>() { @Override public Tuple2<String, String> call(String arg0) throws Exception { // TODO Auto-generated method stub try { if (org.apache.commons.lang.StringUtils.isEmpty(arg0)) { return new Tuple2<String, String>("", ""); } Tuple2<String, String> t = new Tuple2<String, String>(getESIndexName(arg0), arg0); return t; } catch (Exception e) { e.printStackTrace(); System.out.println("******* exception in getESIndexName"); } return new Tuple2<String, String>("", ""); } }); java.util.Map<String, Iterable<String>> map1 = javaPairRDD.groupByKey().collectAsMap();*
Created 06-12-2018 03:14 PM
Hey @pradeep arumalla!
I'm not a specialist in coding or spark, but did you tried to change your groupByKey for reduceByKey (at lhe last line)?
And about the executors --num-executors, how are you launching your job, is it by spark-submit? Could you share with us?
BTW: here's some links about shuffling 🙂
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-shuffle.html
https://0x0fff.com/spark-architecture-shuffle/
Hope this helps!
Created 04-11-2019 09:11 PM
this looks like you have data skew issue, meaning your group by key is skewed, resulting in unbalanced data between partitions. you can inspect your key distribution, if skewness is real, you need to change key or add salt into the groupby so data can be evenly distributed.