Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Spark performance is not as expected using pyspark. takes lot of time to process single query.

Highlighted

Spark performance is not as expected using pyspark. takes lot of time to process single query.

New Contributor

Dear All,

I have 5 Nodes HDP 3.1.0 with Ambari 2.7.3 cluster. I have installed HDFS, Hive, Hbase, Spark service on cluster. ser1.dev.local - HDFS, YARN

ser2.dev.local - Hive

ser3.dev.local - HBase

ser4.dev.local - Zookeeper

ser5.dev.local - Spark

I have 2 workstations, one is for Development and another is having MongoDB:

cpu1.dev.local - Spark client, Anaconda, python, Jupyter notebook

cpu2.dev.local - MongoDB

I have installed spark client on my workstation to access HDFS and spark from cluster using following command:

sudo yum install spark2_3_1_0_0_78* 

I have copied all configuration files from cluster nodes to workstation. I can able to connect to spark and retrieve data from the HDFS cluster. Following is code that I am using for connecting to MongoDB using Pyspark:

import pyspark
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
sparkConf = pyspark.SparkConf().setMaster("spark://ser5.dev.local:7077")
	      .setAppName("SparkSr638").setAll([('spark.executor.memory', '16g'),
			 ('spark.executor.cores', '8'), ('spark.cores.max', '32'),
 			 ('spark.driver.memory','16g'),('spark.driver.maxResultSize','3g')])
sparkConf.set("spark.mongodb.input.uri", "mongodb://cpu2.dev.local/gkfrm.DayTime")
sc = pyspark.SparkContext(conf = sparkConf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load() 

As mentioned above I have created spark application using following configuration:

spark executor memory: 16 GB

Allocated memory: 16 GB Cores

allocated: 24

And other configuration is default set by HDP and Ambari.

The database which I connected is having around 6000000 records.

Now my question is, when I run following code in jupyter notebook using pyspark:

 df.collect() 

to run above code pyspark takes around 3-4 hours.

Why is it taking so much time to process the data. Is there any configuration that need tweaking from HDP Spark or YARN?

Also Just to mention after installing YARN and HBase, Ambari showing one warning for YARN TIMELINE SERVICE V2.0 READER:

ATSv2 HBase Application 
The HBase application reported a 'STARTED' state. Check took 2.253s 

is spark performance depends on this warning? How to resolve this? How should I boost spark performance? Please Help. Thank You in advance.

1 REPLY 1

Re: Spark performance is not as expected using pyspark. takes lot of time to process single query.

New Contributor

How are you submitting the job? At CLI, do you do:

spark2-submit --master yarn ...

If not, it will run locally (on one server)