I am new to PySpark. I am running below PySpark code from Zeppelin of CDP environment. When I run the same query from hive it took 2.75 mins whereas PySpark took 11 mins.
Hive (running from CDP Hue): select col1,col2,col3 from db.tab1 where col5 = <value>
PySpark:
%pyspark
from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("TestApp").getOrCreate())
spark.conf.set('master', 'yarn')
spark.conf.set('deploy-mode', 'cluster')
spark.conf.set('spark.default.parallelism', '24')
spark.conf.set('spark.executor.memory', '12g')#12g
spark.conf.set('spark.executor.cores', '130') #130
spark.conf.set('spark.executor.containers', '35')
spark.conf.set('spark.driver.memory', '20')
spark.conf.set('spark.checkpoint.compress', 'true')
spark.conf.set('spark.driver.maxResultSize' , '20g')
spark.conf.set('spark.dynamicAllocation.enabled','true')
spark.conf.set('spark.sh.service.enabled', 'true')
spark.conf.set('spark.sql.orc.impl', 'native')
spark.conf.set('spark.sql.hive.convertMetastoreOrc', 'true')
spark.conf.set('spark.sql.broadcastTimeout', '36000')
df1 = sqlContext.sql("select * from db.tab1")
df1.select("col1", "col2","col3").where("col5 = <value>").show(5)
I see in the YARN, PySpark Application is not using the custom configuration specified.
I have followed the below link to specify the Spark run config:
https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-me...
Am I missing something?