Member since
06-20-2021
1
Post
0
Kudos Received
0
Solutions
06-20-2021
06:04 PM
I am new to PySpark. I am running below PySpark code from Zeppelin of CDP environment. When I run the same query from hive it took 2.75 mins whereas PySpark took 11 mins. Hive (running from CDP Hue): select col1,col2,col3 from db.tab1 where col5 = <value> PySpark: %pyspark from pyspark.sql import SparkSession spark = (SparkSession.builder.appName("TestApp").getOrCreate()) spark.conf.set('master', 'yarn') spark.conf.set('deploy-mode', 'cluster') spark.conf.set('spark.default.parallelism', '24') spark.conf.set('spark.executor.memory', '12g')#12g spark.conf.set('spark.executor.cores', '130') #130 spark.conf.set('spark.executor.containers', '35') spark.conf.set('spark.driver.memory', '20') spark.conf.set('spark.checkpoint.compress', 'true') spark.conf.set('spark.driver.maxResultSize' , '20g') spark.conf.set('spark.dynamicAllocation.enabled','true') spark.conf.set('spark.sh.service.enabled', 'true') spark.conf.set('spark.sql.orc.impl', 'native') spark.conf.set('spark.sql.hive.convertMetastoreOrc', 'true') spark.conf.set('spark.sql.broadcastTimeout', '36000') df1 = sqlContext.sql("select * from db.tab1") df1.select("col1", "col2","col3").where("col5 = <value>").show(5) I see in the YARN, PySpark Application is not using the custom configuration specified. I have followed the below link to specify the Spark run config: https://stackoverflow.com/questions/37871194/how-to-tune-spark-executor-number-cores-and-executor-memory Am I missing something?
... View more
Labels: