Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark configuration for best performance

Spark configuration for best performance



I have below configutations set for my spark session. I am getting same performance in hive and spark when I run any sql. my understanding is spark should be much faster than hive(which is mapreduce). I am not seeing that performance. am i doing something wrong? am i missing any configuration? any help is appreciated.


from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('yarn') \
.appName('instance to person:household') \
.enableHiveSupport() \
.config('spark.authenticate','true') \
.config('spark.shuffle.manager','sort') \
.config('spark.shuffle.service.enabled','true') \
.config('spark.dynamicAllocation.enabled','true') \
.config('spark.logConf','true') \
.config('spark.shuffle.blockTransferService','nio') \
.config('spark.sql.broadcastTimeout','2400') \
.config('','2400') \
.config('spark.dynamicAllocation.executorIdleTimeout','2400') \
.config('spark.dynamicAllocation.minExecutors','10') \
.config('spark.driver.cores','15') \
.config('spark.executor.memory','2G') \
.config('spark.driver.memory', '60G') \
.config('spark.rdd.compress','true') \