Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Run SparkR | or R package on my Cloudera 5.9 Spark

avatar
Expert Contributor

Hi,

 

I have 3 node cluster having Cloudera 5.9 running on CentOS 6.7. I need to connect my R packages (running on my Laptop) to the Spark runing in cluster mode on Hadoop.

 

However If I try to connect the local R through Sparklyr Connect to Hadoop Spark it is giving Error. As it is searching the Spark home on the laptop itself.

 

I googled and found we can install SparkR and use R with Spark. However I have few questions regarding the same.

 

  1. I have downloaded the tar file from https://amplab-extras.github.io/SparkR-pkg/ But my question is I directly copy it to my Linux server and install?
  2. Do I have to Stop/delete my existing Spark which is NOT Stand Alone and using Yarn i.e. it is running in Cluster mode? or SparkR can just run on top of it, If I install it on the server?
  3. Or do I have to run Spark on Stand Alone mode (get Spark gateways running and Start master/slave using script) and install the package from linux command line on top of it?
  4. If it get installed will I be able to access it through CM UI?

Please help, I am new in this and really need guidance.

 

Thanks,

Shilpa

1 ACCEPTED SOLUTION

avatar
Expert Contributor

Thanks for the reply @srowen

 

The best way to install R and then install SparkR on top of it is here : http://blog.clairvoyantsoft.com/2016/11/installing-sparkr-on-a-hadoop-cluster/ 

 

I was able to install them following this link. It is really useful and latest.

 

Thanks,

Shilpa

View solution in original post

13 REPLIES 13

avatar
Master Collaborator

Generally speaking, you will need to have connectivity from your laptop to at least one machine in the cluster (the gateway), and have some local configuration for sparklyr that indicates where the cluster is. I haven't tried this with sparklyr, but for other R-Hadoop libraries like rhdfs, it means having a copy of the HADOOP_CONF_DIR files from the cluster locally. It also means you probably need the same version of Spark binaries locally as are on the cluster. This is challenging.

 

You may be better off running sparklyr directly on the edge/gateway node of the cluster. See https://blog.cloudera.com/blog/2016/09/introducing-sparklyr-an-r-interface-for-apache-spark/  Instead of installing Spark, point it to a non-local master like "yarn-client" to use the cluster.

 

SparkR is also something you can try to get working. You would probably need to use an upstream sparkr version that's similar to the CDH Spark you're using (1.x vs 2.x) and then just try to run a ./bin/sparkr from its distirbution.

 

Standalone mode isn't supported. None of these (sparkr, sparklyr) are supported by Cloudera, and so have no relationship to CM. You should not modify your existing Spark service and shouldn't have to.

avatar
Expert Contributor

Hi @srowen,

 

Thanks for your reply. 

 

Regarding Sparklyr: 

I already went to the link you mentioned, it gives and example how to connect to your local Spark. Which I have been able to do however if I try to connect to my remote Spark Cluster running on cloudera it is giving error. 

 

library(sparklyr)

 

sc <- spark_connect(master = "spark://lnxmasternode01.centralus.cloudapp.azure.com:7077",

                    spark_home = "hdfs://40.122.210.251:8020/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark",

                    version = "1.6.0")

 

ERROR:

Created default hadoop bin directory under: C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hadoop

Error in start_shell(master = master, spark_home = spark_home, spark_version = version,  :

  SPARK_HOME directory 'hdfs://40.122.210.251:8020/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark' not found

In addition: Warning messages:

1: In dir.create(hivePath, recursive = TRUE) :

  cannot create dir 'C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020', reason 'Invalid argument'

2: In dir.create(hadoopBinPath, recursive = TRUE) :

  cannot create dir 'C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020', reason 'Invalid argument'

3: In file.create(to[okay]) :

  cannot create file 'C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hadoop\bin\winutils.exe', reason 'Invalid argument'

4: running command '"C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hadoop\bin\winutils.exe" chmod 777 "C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hive"' had status 127

 

Now, regarding SparkR:

 

My Spark version is 1.6.0. As I said, I have downloaded sparkR package from https://amplab-extras.github.io/SparkR-pkg/ Do you think, it is an old package and I should search for new one?

 

Once, I have the package, I just untar the package on Namenode, go to bin directory and Execute it. Is that it?

 

What I did to install R on Spark home, I got the epel RPM, and then tried to install R using YUM however its giving error. I even tried some other RPM however they are giving error too. Using --skip-broken option is also not working. Please help

[root@LnxMasterNode01 spark]# rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm

Retrieving http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm

warning: /var/tmp/rpm-tmp.XuRVi8: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY

Preparing... ########################################### [100%]

1:epel-release ########################################### [100%]

 

[root@LnxMasterNode01 spark]# yum install R
Loaded plugins: fastestmirror, security
Setting up Install Process
Loading mirror speeds from cached hostfile
* epel: ftp.osuosl.org
Resolving Dependencies
--> Running transaction check
---> Package R.i686 0:2.13.0-2.el6.rf will be updated
---> Package R.x86_64 0:3.3.2-2.el5 will be an update
--> Processing Dependency: libRmath-devel = 3.3.2-2.el5 for package: R-3.3.2-2.el5.x86_64
--> Processing Dependency: R-devel = 3.3.2-2.el5 for package: R-3.3.2-2.el5.x86_64
--> Running transaction check
---> Package R-devel.x86_64 0:3.3.2-2.el5 will be installed
--> Processing Dependency: R-core-devel = 3.3.2-2.el5 for package: R-devel-3.3.2-2.el5.x86_64
---> Package libRmath-devel.x86_64 0:3.3.2-2.el5 will be installed
--> Processing Dependency: libRmath = 3.3.2-2.el5 for package: libRmath-devel-3.3.2-2.el5.x86_64
--> Running transaction check
---> Package R-core-devel.x86_64 0:3.3.2-2.el5 will be installed
--> Processing Dependency: R-core = 3.3.2-2.el5 for package: R-core-devel-3.3.2-2.el5.x86_64

.

.

.

 --> Processing Dependency: libgssapi.so.2()(64bit) for package: libRmath-3.3.2-2.el5.x86_64
---> Package ppl.x86_64 0:0.10.2-11.el6 will be installed
---> Package texlive-texmf-errata-dvips.noarch 0:2007-7.1.el6 will be installed
---> Package texlive-texmf-errata-fonts.noarch 0:2007-7.1.el6 will be installed
--> Finished Dependency Resolution
Error: Package: R-core-3.3.2-2.el5.x86_64 (epel)
Requires: libtk8.4.so()(64bit)
Error: Package: R-core-3.3.2-2.el5.x86_64 (epel)
Requires: libtcl8.4.so()(64bit)
Error: Package: R-core-3.3.2-2.el5.x86_64 (epel)
Requires: libgssapi.so.2(libgssapi_CITI_2)(64bit)
Error: Package: R-core-3.3.2-2.el5.x86_64 (epel)
Requires: libRblas.so()(64bit)
Error: Package: libRmath-3.3.2-2.el5.x86_64 (epel)
Requires: libgssapi.so.2(libgssapi_CITI_2)(64bit)
Error: Package: libRmath-3.3.2-2.el5.x86_64 (epel)
Requires: libgssapi.so.2()(64bit)
Error: Package: R-core-3.3.2-2.el5.x86_64 (epel)
Requires: libgssapi.so.2()(64bit)
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
[root@LnxMasterNode01 spark]# 

 

I followed this link http://www.jason-french.com/blog/2013/03/11/installing-r-in-linux/ and even it suggests te same. Looking forward to your reply. Am I doing something wrong here?

 

Thanks,

Shilpa

avatar
Expert Contributor

Hi @srowen

 

The issue related installing R using epel rpm is resolved.

 

 

I guess, previously install the wrong EPEL release package on this machine. So to resolve it, I did:

[root@LnxMasterNode01 spark]# yum clean all

[root@LnxMasterNode01 spark]# yum install epel-release

[root@LnxMasterNode01 spark]# yum install R

 

Now, I am able to run 'R' however I cannot see it in my Spark home directory nor spark/bin has sparkR.

 

[root@LnxMasterNode01 spark]# ll

total 36276

drwxr-xr-x 3 root root 4096 Oct 21 05:00 assembly

drwxr-xr-x 2 root root 4096 Oct 21 05:00 bin

drwxr-xr-x 2 root root 4096 Oct 21 05:00 cloudera

lrwxrwxrwx 1 root root 15 Nov 25 16:01 conf -> /etc/spark/conf

-rw-r--r-- 1 root root 12232 Jan 4 16:20 epel-release-5-4.noarch.rpm

drwxr-xr-x 3 root root 4096 Oct 21 05:00 examples

drwxr-xr-x 2 root root 4096 Oct 21 05:08 lib

-rw-r--r-- 1 root root 17352 Oct 21 05:00 LICENSE

drwxr-xr-x 2 root root 4096 Jan 2 18:09 logs

-rw-r--r-- 1 root root 23529 Oct 21 05:00 NOTICE

drwxr-xr-x 6 root root 4096 Oct 21 05:00 python

-rw-r--r-- 1 root root 37053596 Jan 4 17:16 R-2.13.0-2.el6.rf.i686.rpm

-rw-r--r-- 1 root root 0 Oct 21 05:00 RELEASE

drwxr-xr-x 2 root root 4096 Oct 21 05:00 sbin

lrwxrwxrwx 1 root root 19 Nov 25 16:01 work -> /var/run/spark/work

[root@LnxMasterNode01 spark]#

 

Is it same as SparkR? Please guide

 

[root@LnxMasterNode01 ~]# R

R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"

Copyright (C) 2016 The R Foundation for Statistical Computing

Platform: x86_64-redhat-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions. .

Type 'demo()' for some demos, 'help()' for on-line help, or

'help.start()' for an HTML browser interface to help.

Type 'q()' to quit R.

> q()

 

My other question related to Sparklyr is still like earlier. Please guide.

avatar
Master Collaborator

Generally, you won't be able to run R on your laptop/workstation and connect it remotely to the cluster. It's possible, but would require more setup and configuration, so I would avoid this deployment for now. Instead, run R on a cluster gateway node.

 

You are using a standalone master, which isn't supported anyway. You would want to use YARN.

 

Although you should be able to use your own copy of SparkR 1.6 with the cluster, I don't know if it works. It's not supported. sparklyr is another option, which at least is supported by RStudio.

avatar
Expert Contributor

Thanks for the reply @srowen

 

The best way to install R and then install SparkR on top of it is here : http://blog.clairvoyantsoft.com/2016/11/installing-sparkr-on-a-hadoop-cluster/ 

 

I was able to install them following this link. It is really useful and latest.

 

Thanks,

Shilpa

avatar
Explorer

http://blog.clairvoyantsoft.com/2016/11/installing-sparkr-on-a-hadoop-cluster/

 

worked fine for me as well . Just few things I had to do extra :

 

1. In the testing section when I typed sparkR , it errored out . Seems you'll have to create links for that to work . In my case I had CDH parcel installation , thus I created below two links , and it worked fine therefater :

 

# cp /usr/bin/sparkR /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/bin/

# rm /usr/bin/sparkR

# cd /etc/alternatives/

# ln -s /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/bin/sparkR sparkR

# cd /usr/bin

# ln -s /etc/alternatives/sparkR sparkR

# sparkR

avatar
Explorer

Can you provide source for the link you provided as the link you have no longer works...part of the issue with the internet in general I guess 🙂

avatar
Master Collaborator

avatar
Explorer

Thanks!  That works 🙂