Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
I'm using SparkSQL to make fact table out of 5 dimensions. I'm facing performance issue (job is taking several hours to complete), and even after exhaustive googleing I see no solution. These are settings I have tried turing, but no sucess.
sqlContext.sql("set spark.sql.shuffle.partitions=10"); // varied between 10 and 5000 sqlContext.sql("set spark.sql.autoBroadcastJoinThreshold=500000000"); // 500 MB, tried 1 GB
Most of RDDs are nicely parittions (500 partitions each), however largest dimension is not partitioned at all (images). Maybe this can lead to solution ? Below is code I have used for making fact table.