Support Questions

girish_khole · ‎08-10-2018

I have installed HDP 3.0 cluster on 5 nodes. and installed Spark 2.3.1 using Ambari service on one of the node. Spark installed node is: ser5.dev.local

I am trying to access this spark from other system which is not part of the cluster say cpu686.dev.local using pyspark in jupyter notebook. please find below code for reference:

import pyspak
from pyspark import SQLContext
conf = pyspark.SparkConf().setMaster("spark://ser5.dev.local:7077").setAppName("SparkServer1").setAll([('spark.executor.memory', '16g'), ('spark.executor.cores', '8'), ('spark.cores.max', '8'), ('spark.driver.memory','16g')])
sc = pyspark.SparkContext(conf=conf)
rddFile = sc.textFile("Filterd_data.csv")
rddFile = rddFile.mapPartitions(lambda x: csv.reader(x))
rddFile.collect()

Now, all connection is proper. spark context is created using the spark://ser5.dev.local:7077 url. RDD rddFile is also ran successfully. but when I ran rddFile.collect() then it keeps running. no output no error. Even we tried to upload csv file with less than 10 records. still it kept on running the code.

Is there any way that i can configure Spark, or where i can get master url to check running application in spark. when i click on spark UI in ambari it opens spark-history-server.

We tried csv file upload from HDFS using following code

conf = pyspark.SparkConf().setMaster("spark://ser5.dev.local:7077").setAppName("SparkServer1").setAll([('spark.executor.memory', '16g'), ('spark.executor.cores', '8'), ('spark.cores.max', '8'), ('spark.driver.memory','16g')])
sc = pyspark.SparkContext(conf=conf)
sqlC = SQLContext(sc)
df = sqlC.read.csv("hdfs://ser2.dev.local:8020/UnusualTime/Filterd_data.csv")

Still issue remains same.

Note: I installed spark using following documentation:

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.0/installing-spark/content/installing_spark_u...

falbani · ‎08-10-2018

@Girish Khole

How did you installed the spark client that is not part of the cluster? There are few considerations if the node is not managed by ambari such as:

1. The spark client version should be same as the one in the cluster

2. You need to make sure all the configuration files for hdfs/yarn/hive are copied from the cluster

3. When you launch a client in spark master mode this does not run in the cluster. This is running in standalone mode. To test cluster you need to use --master yarn (which can be used with client or cluster deployment modes)

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

View solution in original post

falbani · ‎08-10-2018

@Girish Khole

How did you installed the spark client that is not part of the cluster? There are few considerations if the node is not managed by ambari such as:

1. The spark client version should be same as the one in the cluster

2. You need to make sure all the configuration files for hdfs/yarn/hive are copied from the cluster

3. When you launch a client in spark master mode this does not run in the cluster. This is running in standalone mode. To test cluster you need to use --master yarn (which can be used with client or cluster deployment modes)

HTH

*** If you found this answer addressed your question, please take a moment to login and click the "accept" link on the answer.

falbani · ‎08-13-2018

@Girish Khole, did the above helped?

girish_khole · ‎08-16-2018

Thank you very much @Felix Albani, I copied yarn-site.xml, core-site.xml, hdfs-site.xml to standalone spark instance. and started spark on HDP, and connection established successfully. issue got resolved. Thanks..

Cloudera Community

Support Questions

How do I identify Spark 2.3.1 installed on HDP 3.0 is working properly?