Support Questions
Find answers, ask questions, and share your expertise

PySpark Custom Config not being considered

PySpark Custom Config not being considered

New Contributor

I am new to PySpark. I am running below PySpark code from Zeppelin of CDP environment. When I run the same query from hive it took 2.75 mins whereas PySpark took 11 mins.

 

Hive (running from CDP Hue): select col1,col2,col3 from db.tab1 where col5 = <value>

 

PySpark: 

%pyspark
from pyspark.sql import SparkSession
spark = (SparkSession.builder.appName("TestApp").getOrCreate())
spark.conf.set('master', 'yarn')
spark.conf.set('deploy-mode', 'cluster')
spark.conf.set('spark.default.parallelism', '24')
spark.conf.set('spark.executor.memory', '12g')#12g
spark.conf.set('spark.executor.cores', '130') #130
spark.conf.set('spark.executor.containers', '35')
spark.conf.set('spark.driver.memory', '20')
spark.conf.set('spark.checkpoint.compress', 'true')
spark.conf.set('spark.driver.maxResultSize' , '20g')
spark.conf.set('spark.dynamicAllocation.enabled','true')
spark.conf.set('spark.sh.service.enabled', 'true')
spark.conf.set('spark.sql.orc.impl', 'native')
spark.conf.set('spark.sql.hive.convertMetastoreOrc', 'true')
spark.conf.set('spark.sql.broadcastTimeout', '36000')

 

df1 = sqlContext.sql("select * from db.tab1")
df1.select("col1", "col2","col3").where("col5 = <value>").show(5)

 

I see in the YARN, PySpark Application is not using the custom configuration specified.

 

I have followed the below link to specify the Spark run config:

https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-me...

 

Am I missing something?

1 REPLY 1

Re: PySpark Custom Config not being considered

Contributor

Hi @SudEl 

 

Please try to modify required parameters (memory and other tuning parameters)  in spark interpreter.