Reply
Highlighted
Explorer
Posts: 9
Registered: ‎03-12-2018

Spark configuration for best performance

Hi,

I have below configutations set for my spark session. I am getting same performance in hive and spark when I run any sql. my understanding is spark should be much faster than hive(which is mapreduce). I am not seeing that performance. am i doing something wrong? am i missing any configuration? any help is appreciated.

 

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('yarn') \
.appName('instance to person:household') \
.enableHiveSupport() \
.config('spark.authenticate','true') \
.config('spark.shuffle.manager','sort') \
.config('spark.shuffle.service.enabled','true') \
.config('spark.dynamicAllocation.enabled','true') \
.config('spark.logConf','true') \
.config('spark.shuffle.blockTransferService','nio') \
.config('spark.sql.broadcastTimeout','2400') \
.config('spark.network.timeout','2400') \
.config('spark.dynamicAllocation.executorIdleTimeout','2400') \
.config('spark.dynamicAllocation.minExecutors','10') \
.config('spark.driver.cores','15') \
.config('spark.executor.memory','2G') \
.config('spark.driver.memory', '60G') \
.config('spark.rdd.compress','true') \
.getOrCreate()

Announcements