Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎11-12-2018

Configuring Spark to be executed remotely [Cloudera CDH 6.0.1]

[ Edited ]

Hello everyone. After installing and configuring the cloud with Cloudera CDH 6.0.1 (HBase, HDFS, SPARK, YARN, Zookeper), I cannot run a spark job from a remote machine. Basically, I have a python spark script on my laptop and I want to execute it on the remote cluster.

 

On my laptop, I have download spark 2.2.0 source codes. Then, I have logged in the remote machine in which I have executed cloudera-manager-installer.bin and copy all the files in /etc/spark/conf/ to spark/conf on my laptop. Afterwards, still in my laptop, I have set SPARK_HOME to the full path of the spark folder on my laptop (the variable is set into spark-env.sh).

 

To test spark, I run:

 

 

bin/spark-submit --master yarn pi.py

 

 However, the terminal prints the following output (note that the script does not write any file on HDFS, it simply computes Pi):

 

ERROR spark.SparkContext: Error initializing SparkContext. org.apache.hadoop.security.AccessControlException: Permission denied: user=simon, access=WRITE, inode="/user":hdfs:supergroup:drwxr-xr-x 

It seems like that my script wants to access to /user (on HDFS) in which it is not supposed to read/write anything, because it only needs to compute Pi and print its value on the terminal.

 

The following code is the python script:

 

import sys
from random import random
from operator import add

from pyspark.sql import SparkSession


if __name__ == "__main__":
"""
Usage: pi [partitions]
"""
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()

partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
n = 100000 * partitions

def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0

count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))

spark.stop()

 

I looked also in this link but I still got stuck when it comes to run the python script.

 

EDIT 1: I have solved the HDSF permission problem by unchecking the 'Check HDFS Permissions' (HDFS setting on CLoudera UI). Unfortunately, now it keeps printing 

 

INFO yarn.Client: Application report for application_1542027447234_0030 (state: ACCEPTED)
INFO yarn.Client: Application report for application_1542027447234_0030 (state: ACCEPTED)
INFO yarn.Client: Application report for application_1542027447234_0030 (state: ACCEPTED)
INFO yarn.Client: Application report for application_1542027447234_0030 (state: ACCEPTED)
INFO yarn.Client: Application report for application_1542027447234_0030 (state: ACCEPTED)
INFO yarn.Client: Application report for application_1542027447234_0030 (state: ACCEPTED)
INFO yarn.Client: Application report for application_1542027447234_0030 (state: ACCEPTED)

 

From my understanding, this problem appens when there are no enugh resources, or there is some problems in one of the configuration file, such as, core-site.xml, hdfs-site.xml, mapred-site.xml or yarn-site.xml. There are no other application running, meaning that most probably is a configuration problem. Which settings should I set in core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml? What I did, was to copy the files from /etc/spark/conf/ into my spark/conf on my laptop. However, people posted in different forums their configuration files, and they are all different. So, I do not understand which setting should I set in those files. Moreover, I looked into the log file and it just says 

 

ACCEPTED: waiting for AM container to be allocated, launched and register with RM.

 

Any idea why I get that error?

 

Thanks

Announcements