Created on 01-03-2017 10:41 AM - edited 09-16-2022 03:52 AM
Hi,
I have 3 node cluster having Cloudera 5.9 running on CentOS 6.7. I need to connect my R packages (running on my Laptop) to the Spark runing in cluster mode on Hadoop.
However If I try to connect the local R through Sparklyr Connect to Hadoop Spark it is giving Error. As it is searching the Spark home on the laptop itself.
I googled and found we can install SparkR and use R with Spark. However I have few questions regarding the same.
Please help, I am new in this and really need guidance.
Thanks,
Shilpa
Created 01-11-2017 04:20 PM
Thanks for the reply @srowen
The best way to install R and then install SparkR on top of it is here : http://blog.clairvoyantsoft.com/2016/11/installing-sparkr-on-a-hadoop-cluster/
I was able to install them following this link. It is really useful and latest.
Thanks,
Shilpa
Created 01-04-2017 04:30 AM
Generally speaking, you will need to have connectivity from your laptop to at least one machine in the cluster (the gateway), and have some local configuration for sparklyr that indicates where the cluster is. I haven't tried this with sparklyr, but for other R-Hadoop libraries like rhdfs, it means having a copy of the HADOOP_CONF_DIR files from the cluster locally. It also means you probably need the same version of Spark binaries locally as are on the cluster. This is challenging.
You may be better off running sparklyr directly on the edge/gateway node of the cluster. See https://blog.cloudera.com/blog/2016/09/introducing-sparklyr-an-r-interface-for-apache-spark/ Instead of installing Spark, point it to a non-local master like "yarn-client" to use the cluster.
SparkR is also something you can try to get working. You would probably need to use an upstream sparkr version that's similar to the CDH Spark you're using (1.x vs 2.x) and then just try to run a ./bin/sparkr from its distirbution.
Standalone mode isn't supported. None of these (sparkr, sparklyr) are supported by Cloudera, and so have no relationship to CM. You should not modify your existing Spark service and shouldn't have to.
Created on 01-04-2017 09:31 AM - edited 01-04-2017 04:03 PM
Hi @srowen,
Thanks for your reply.
Regarding Sparklyr:
I already went to the link you mentioned, it gives and example how to connect to your local Spark. Which I have been able to do however if I try to connect to my remote Spark Cluster running on cloudera it is giving error.
library(sparklyr)
sc <- spark_connect(master = "spark://lnxmasternode01.centralus.cloudapp.azure.com:7077",
spark_home = "hdfs://40.122.210.251:8020/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark",
version = "1.6.0")
ERROR:
Created default hadoop bin directory under: C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hadoop
Error in start_shell(master = master, spark_home = spark_home, spark_version = version, :
SPARK_HOME directory 'hdfs://40.122.210.251:8020/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/spark' not found
In addition: Warning messages:
1: In dir.create(hivePath, recursive = TRUE) :
cannot create dir 'C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020', reason 'Invalid argument'
2: In dir.create(hadoopBinPath, recursive = TRUE) :
cannot create dir 'C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020', reason 'Invalid argument'
3: In file.create(to[okay]) :
cannot create file 'C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hadoop\bin\winutils.exe', reason 'Invalid argument'
4: running command '"C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hadoop\bin\winutils.exe" chmod 777 "C:\Users\diegot\Desktop\hdfs:\40.122.210.251:8020\opt\cloudera\parcels\CDH-5.9.0-1.cdh5.9.0.p0.23\lib\spark\tmp\hive"' had status 127
Now, regarding SparkR:
My Spark version is 1.6.0. As I said, I have downloaded sparkR package from https://amplab-extras.github.io/SparkR-pkg/ Do you think, it is an old package and I should search for new one?
Once, I have the package, I just untar the package on Namenode, go to bin directory and Execute it. Is that it?
What I did to install R on Spark home, I got the epel RPM, and then tried to install R using YUM however its giving error. I even tried some other RPM however they are giving error too. Using --skip-broken option is also not working. Please help
[root@LnxMasterNode01 spark]# rpm -Uvh http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
Retrieving http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
warning: /var/tmp/rpm-tmp.XuRVi8: Header V3 RSA/SHA256 Signature, key ID 0608b895: NOKEY
Preparing... ########################################### [100%]
1:epel-release ########################################### [100%]
[root@LnxMasterNode01 spark]# yum install R
Loaded plugins: fastestmirror, security
Setting up Install Process
Loading mirror speeds from cached hostfile
* epel: ftp.osuosl.org
Resolving Dependencies
--> Running transaction check
---> Package R.i686 0:2.13.0-2.el6.rf will be updated
---> Package R.x86_64 0:3.3.2-2.el5 will be an update
--> Processing Dependency: libRmath-devel = 3.3.2-2.el5 for package: R-3.3.2-2.el5.x86_64
--> Processing Dependency: R-devel = 3.3.2-2.el5 for package: R-3.3.2-2.el5.x86_64
--> Running transaction check
---> Package R-devel.x86_64 0:3.3.2-2.el5 will be installed
--> Processing Dependency: R-core-devel = 3.3.2-2.el5 for package: R-devel-3.3.2-2.el5.x86_64
---> Package libRmath-devel.x86_64 0:3.3.2-2.el5 will be installed
--> Processing Dependency: libRmath = 3.3.2-2.el5 for package: libRmath-devel-3.3.2-2.el5.x86_64
--> Running transaction check
---> Package R-core-devel.x86_64 0:3.3.2-2.el5 will be installed
--> Processing Dependency: R-core = 3.3.2-2.el5 for package: R-core-devel-3.3.2-2.el5.x86_64.
.
.
--> Processing Dependency: libgssapi.so.2()(64bit) for package: libRmath-3.3.2-2.el5.x86_64
---> Package ppl.x86_64 0:0.10.2-11.el6 will be installed
---> Package texlive-texmf-errata-dvips.noarch 0:2007-7.1.el6 will be installed
---> Package texlive-texmf-errata-fonts.noarch 0:2007-7.1.el6 will be installed
--> Finished Dependency Resolution
Error: Package: R-core-3.3.2-2.el5.x86_64 (epel)
Requires: libtk8.4.so()(64bit)
Error: Package: R-core-3.3.2-2.el5.x86_64 (epel)
Requires: libtcl8.4.so()(64bit)
Error: Package: R-core-3.3.2-2.el5.x86_64 (epel)
Requires: libgssapi.so.2(libgssapi_CITI_2)(64bit)
Error: Package: R-core-3.3.2-2.el5.x86_64 (epel)
Requires: libRblas.so()(64bit)
Error: Package: libRmath-3.3.2-2.el5.x86_64 (epel)
Requires: libgssapi.so.2(libgssapi_CITI_2)(64bit)
Error: Package: libRmath-3.3.2-2.el5.x86_64 (epel)
Requires: libgssapi.so.2()(64bit)
Error: Package: R-core-3.3.2-2.el5.x86_64 (epel)
Requires: libgssapi.so.2()(64bit)
You could try using --skip-broken to work around the problem
You could try running: rpm -Va --nofiles --nodigest
[root@LnxMasterNode01 spark]#
I followed this link http://www.jason-french.com/blog/2013/03/11/installing-r-in-linux/ and even it suggests te same. Looking forward to your reply. Am I doing something wrong here?
Thanks,
Shilpa
Created on 01-04-2017 04:57 PM - edited 01-04-2017 04:58 PM
Hi @srowen
The issue related installing R using epel rpm is resolved.
I guess, previously install the wrong EPEL release package on this machine. So to resolve it, I did:
[root@LnxMasterNode01 spark]# yum clean all
[root@LnxMasterNode01 spark]# yum install epel-release
[root@LnxMasterNode01 spark]# yum install R
Now, I am able to run 'R' however I cannot see it in my Spark home directory nor spark/bin has sparkR.
[root@LnxMasterNode01 spark]# ll
total 36276
drwxr-xr-x 3 root root 4096 Oct 21 05:00 assembly
drwxr-xr-x 2 root root 4096 Oct 21 05:00 bin
drwxr-xr-x 2 root root 4096 Oct 21 05:00 cloudera
lrwxrwxrwx 1 root root 15 Nov 25 16:01 conf -> /etc/spark/conf
-rw-r--r-- 1 root root 12232 Jan 4 16:20 epel-release-5-4.noarch.rpm
drwxr-xr-x 3 root root 4096 Oct 21 05:00 examples
drwxr-xr-x 2 root root 4096 Oct 21 05:08 lib
-rw-r--r-- 1 root root 17352 Oct 21 05:00 LICENSE
drwxr-xr-x 2 root root 4096 Jan 2 18:09 logs
-rw-r--r-- 1 root root 23529 Oct 21 05:00 NOTICE
drwxr-xr-x 6 root root 4096 Oct 21 05:00 python
-rw-r--r-- 1 root root 37053596 Jan 4 17:16 R-2.13.0-2.el6.rf.i686.rpm
-rw-r--r-- 1 root root 0 Oct 21 05:00 RELEASE
drwxr-xr-x 2 root root 4096 Oct 21 05:00 sbin
lrwxrwxrwx 1 root root 19 Nov 25 16:01 work -> /var/run/spark/work
[root@LnxMasterNode01 spark]#
Is it same as SparkR? Please guide
[root@LnxMasterNode01 ~]# R
R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions. .
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> q()
My other question related to Sparklyr is still like earlier. Please guide.
Created 01-06-2017 03:26 AM
Generally, you won't be able to run R on your laptop/workstation and connect it remotely to the cluster. It's possible, but would require more setup and configuration, so I would avoid this deployment for now. Instead, run R on a cluster gateway node.
You are using a standalone master, which isn't supported anyway. You would want to use YARN.
Although you should be able to use your own copy of SparkR 1.6 with the cluster, I don't know if it works. It's not supported. sparklyr is another option, which at least is supported by RStudio.
Created 01-11-2017 04:20 PM
Thanks for the reply @srowen
The best way to install R and then install SparkR on top of it is here : http://blog.clairvoyantsoft.com/2016/11/installing-sparkr-on-a-hadoop-cluster/
I was able to install them following this link. It is really useful and latest.
Thanks,
Shilpa
Created 01-25-2017 01:26 PM
http://blog.clairvoyantsoft.com/2016/11/installing-sparkr-on-a-hadoop-cluster/
worked fine for me as well . Just few things I had to do extra :
1. In the testing section when I typed sparkR , it errored out . Seems you'll have to create links for that to work . In my case I had CDH parcel installation , thus I created below two links , and it worked fine therefater :
# cp /usr/bin/sparkR /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/bin/
# rm /usr/bin/sparkR
# cd /etc/alternatives/
# ln -s /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/bin/sparkR sparkR
# cd /usr/bin
# ln -s /etc/alternatives/sparkR sparkR
# sparkR
Created 04-24-2018 02:50 PM
Can you provide source for the link you provided as the link you have no longer works...part of the issue with the internet in general I guess 🙂
Created 04-25-2018 08:49 AM
Created 04-25-2018 03:11 PM
Thanks! That works 🙂