Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark-SQL and skewness

Highlighted

Spark-SQL and skewness

Hi,

I have one query...

I have 2 tables which I want to join and insert the result into third table. the only query in my scala code is below.

hc.sql("insert overwrite table table3 select a.id,a.col1, b.col2, b.col3 From table1 a left join table2 on a.id = b.id")

Below is some stats of two tables...

hive> select year , count(*) from table1 GROUP BY year;

year count

2012 23647

2014 18132609

2010 14638

2013 312103

2009 110228

2011 23854

2015 44628890

hive> select year , count(*) from table2 GROUP BY year;

year count

2014 8315513

2015 38881691

2012 29

2013 180718

2011 22

Note - Tables are not partitioned.

The above query is being executed using spark-submit ( mentioned below) , resulting a highly skewed execution where load is uneven among executors and 2 executors are having 75% of load, hence overall execution taking more than 2 hrs. Same query in hive executes in 15 min.

spark-submit --class myDriver --master yarn --deploy-mode cluster --driver-memory 15g --num-executors 25 --executor-cores 5 --executor-memory 15g --driver-cores 3 --conf "spark.executor.memory=-XX:+UseG1GC -XX:+PrintFlagsFinal -XX:PermSize=1g -XX:+PrintReferenceGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -Xms10g -Xmx10g -XX:NewRatio=2 -XX:InitiatingHeapOccupancyPercent=30 -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20 -XX:ConcGCThread=5" --conf spark.yarn.executor.memoryOverhead=2000 --conf spark.sql.shuffle.partitions=50 --conf spark.storage.memoryFraction=0.20 --conf spark.shuffle.memoryFraction=0.60 --conf spark.shuffle.manager=tungsten-sort my.jar

I have 15 active nodes, with 6 cores per node for spark, and ~49 G ram per node

Any tips to improve on skewness.

Regards

Praveen Khare

Don't have an account?
Coming from Hortonworks? Activate your account here