I have 5 Nodes HDP 3.1.0 with Ambari 2.7.3 cluster. I have installed HDFS, Hive, Hbase, Spark service on cluster.
ser1.dev.local - HDFS, YARN
ser2.dev.local - Hive
ser3.dev.local - HBase
ser4.dev.local - Zookeeper
ser5.dev.local - Spark
I have 2 workstations, one is for Development and another is having MongoDB:
cpu1.dev.local - Spark client, Anaconda, python, Jupyter notebook
cpu2.dev.local - MongoDB
I have installed spark client on my workstation to access HDFS and spark from cluster using following command:
sudo yum install spark2_3_1_0_0_78*
I have copied all configuration files from cluster nodes to workstation. I can able to connect to spark and retrieve data from the HDFS cluster.
Following is code that I am using for connecting to MongoDB using Pyspark:
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
sparkConf = pyspark.SparkConf().setMaster("spark://ser5.dev.local:7077")
('spark.executor.cores', '8'), ('spark.cores.max', '32'),
sc = pyspark.SparkContext(conf = sparkConf)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
As mentioned above I have created spark application using following configuration:
spark executor memory: 16 GB
Allocated memory: 16 GB
And other configuration is default set by HDP and Ambari.
The database which I connected is having around 6000000 records.
Now my question is, when I run following code in jupyter notebook using pyspark:
to run above code pyspark takes around 3-4 hours.
Why is it taking so much time to process the data. Is there any configuration that need tweaking from HDP
Spark or YARN?
Also Just to mention after installing YARN and HBase, Ambari showing one warning for YARN TIMELINE SERVICE V2.0 READER:
ATSv2 HBase Application
The HBase application reported a 'STARTED' state. Check took 2.253s
is spark performance depends on this warning? How to resolve this? How should I boost spark performance? Please Help.
Thank You in advance.