Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.

failure on spark_connect() for sparklyr R package use on a Cloudera CDH 5.10.0 Hadoop cluster

failure on spark_connect() for sparklyr R package use on a Cloudera CDH 5.10.0 Hadoop cluster

New Contributor

 

Hi folks,
 
I am trying to get the sparklyr R package to work with Spark on a local Linux cluster. Works fine under Spark on my laptop - but now I want to parallelize R code  on a actual cluster. I posted the message below as a new issue on the sparklyr page at github over a week ago, and I got one reply. Which, alas, did not succed in solving my problem. So  I am posting again here, in the hope somebody reading this can provide guidance. I need to get R working in Spark on our cluster. Eager to get going, if I can get out of the gate. But have to get past this spark_connect() problem. Please see the below.
 
   - Ron Taylor
   Pacific Northwest National Laboratory
   
%%%%%%%%%%%%%
 
This is my earlier msg posted at the sparklyr page at github:
 
Hello folks,
I am trying to use sparklyr for the first time on a Hadoop cluster. I have used R with sparklyr on a "local" copy of Spark on my Mac laptop, but this is the first time that I am trying to run it as a "yarn-client" on a true cluster, to actually get some parallelization out of sparklyr use.
 
We have a small Linux cluster at our lab running Cloudera CDH 5.10.0. When I try to do the spark_connect() from an R session started on a command line on the Hadoop cluster's name (master) node, I get the same msg as in an earlier closed issue on the sparklyr github site.
 
That is, my error msg is:
 
"Failed while connecting to sparklyr to port (8880) for sessionid (2423): Gateway in port (8880) did not respond."
 
I am thus reopening that issue here, since I still need help even after reading that older github issue (#394).
 
At bottom is the record of my R session on the Hadoop cluster's name node, with all the details that I can think of printed out to the screen.
 
I note that the version of Spark used by CDH is 1.6.0, which is different than what is in spark_home_dir (1.6.2). I cannot seem to change the spark_home_dir by setting SPARK_HOME to the Spark location used by the CDH distribution. spark_home_dir does not get altered by my setting of SPARK_HOME (as you can see below). So one question (perhaps the critical question?) is: how do I force sparklyr to connect to the Spark version being used by the CDH distribution?
 
the 5.10.0 distribution has Spark 1.6.0, not 1.6.2.
 
So I am trying to tell sparklyr code to use the Spark 1.6.0 distribution that is located here:
 
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-shell
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark
 
and so I was trying to set SPARK_HOME as follows:
 
Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41")
 
Sys.getenv("SPARK_HOME")
[1] "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41"
 
However! I note that part of the error msg (see bottom) says that the correct path was used to spark-submit:
"Path: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit"
 
So maybe sparklyr is indeed accessing the Spark 1.6.0 distribution as it should in the cluster, and the problem lies elsewhere??
 
One other note: there was an earlier version of sparklyr installed by support here on the Hadoop name node. I have bypassed that, installed the latest version of sparklyr (0.5.1) into
 
/people/rtaylor/Rpackages
 
as you can see below.
 
Would very much appreciate some guidance to get me over this initial hurdle.
  • Ron Taylor
    Pacific Northwest National Laboratory

%%%%%%%%%%%%%%%%%%%%%%%
 
Here is the captured screen output from the R session with my failed spark_connect() call  (with the R session being done at the Linux command prompt on the Hadoop cluster namenode):
 
[rtaylor@bigdatann Rwork]$ R
R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
 
ls()
character(0)
 
(.packages())
[1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
[7] "base"
 
install.packages("sparklyr", lib="/people/rtaylor/Rpackages/")
--- Please select a CRAN mirror for use in this session ---
trying URL 'https://cran.cnr.berkeley.edu/src/contrib/sparklyr_0.5.2.tar.gz'
Content type 'application/x-gzip' length 732806 bytes (715 KB)
==================================================
downloaded 715 KB
  • installing source package ‘sparklyr’ ...
    ** package ‘sparklyr’ successfully unpacked and MD5 sums checked
    ** R
    ** inst
    ** preparing package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    ** testing if installed package can be loaded
  • DONE (sparklyr)
The downloaded source packages are in
‘/tmp/RtmpQUB4IE/downloaded_packages’
 
library(sparklyr, lib.loc="/people/rtaylor/Rpackages/")
 
(.packages())
[1] "sparklyr" "stats" "graphics" "grDevices" "utils" "datasets"
[7] "methods" "base"
 
 
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Workstation release 6.4 (Santiago)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sparklyr_0.5.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 withr_1.0.2 digest_0.6.12 dplyr_0.5.0
[5] rprojroot_1.2 assertthat_0.1 rappdirs_0.3.1 R6_2.2.0
[9] jsonlite_1.2 DBI_0.5-1 backports_1.0.5 magrittr_1.5
[13] httr_1.2.1 config_0.2 tools_3.3.2 parallel_3.3.2
[17] yaml_2.1.14 base64enc_0.1-3 tcltk_3.3.2 tibble_1.2
 
Sys.getenv("JAVA_HOME")
[1] "/usr/java/latest"
 
spark_installed_versions()
spark hadoop dir
1 1.6.2 2.6 spark-1.6.2-bin-hadoop2.6
 
spark_home_dir()
[1] "/people/rtaylor/.cache/spark/spark-1.6.2-bin-hadoop2.6"
R.home(component = "home")
[1] "/share/apps/R/3.3.2/lib64/R"
 
path.expand("~")
[1] "/people/rtaylor"
 
Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41")
 
Sys.getenv("SPARK_HOME")
[1] "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41"
 
spark_home_dir()
[1] "/people/rtaylor/.cache/spark/spark-1.6.2-bin-hadoop2.6"
 
ls()
character(0)
 
config <- spark_config()
 
ls()
[1] "config"
 
 
And, finally, here is the actual spark_connect() command where I see the failure:
 
sc <- spark_connect(master = "yarn-client", config = config, version = "1.6.0")
Error in force(code) :
Failed while connecting to sparklyr to port (8880) for sessionid (2423): Gateway in port (8880) did not respond.
Path: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit
Parameters: --class, sparklyr.Backend, --jars, '/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/spark-csv_2.11-1.3.0.jar','/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/commons-csv-1.1.jar','/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/univocity-parsers-1.5.1.jar', '/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 2423
---- Output Log ----
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: No such file or directory
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: exec: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: cannot execute: No such file or directory
---- Error Log ----
 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 
And below you can see the the results of a similar R session, which I just tried today after hearing back from Mr. Ariga via github on my posted issue, and trying what he suggested. Again, I failed. Still need help.

---------- Forwarded message ----------
From: Ronald Taylor
Date: Sun, Mar 19, 2017 at 5:23 PM
Subject: Re: [rstudio/sparklyr] problem with spark_connect() using sparklyr on a Cloudera CDH 5.10.0 Hadoop cluster (#534)
To: rstudio/sparklyr <reply+00525e71db98e9f7adda4ea9e8bcc457e983afabaf2725ba92cf0000000114e3a7e692a169ce0ca76ebc@reply.git...>

Hi Aki,

Thanks for the guidance, but  I still cannot get spark_connect() to work. Very disappointing.

You can see the screen output for my connect attempts below. Also,  I checked out the Cloudera web page that you listed - but I don't see anything there that usefully supplements your email to me.

And so I am still stuck. Can you (or anbody else on the list) think of of anything else I can try? Spark 1.6.0  is running fine on the Cloudera cluster that I am trying to use, according to the Cloudera Manager. So the spark_connect() *should* work, but is not.
  - Ron

Screen output:

>
> config <- spark_config()
>

> spark_home <- "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41"
>
> spark_home
[1] "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41"
>
>
> spark_version <- "1.6.0"
>
> spark_version
[1] "1.6.0"
>
 
> sc <- spark_connect(master = "yarn-client", config = config, version = spark_version, spark_home=spark_home)
Error in force(code) :
  Failed while connecting to sparklyr to port (8880) for sessionid (8451): Gateway in port (8880) did not respond.
    Path: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit
    Parameters: --class, sparklyr.Backend, --jars, '/people/rtaylor/Rpackages/sparklyr/java/spark-csv_2.11-1.3.0.jar','/people/rtaylor/Rpackages/sparklyr/java/commons-csv-1.1.jar','/people/rtaylor/Rpackages/sparklyr/java/univocity-parsers-1.5.1.jar', '/people/rtaylor/Rpackages/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 8451


---- Output Log ----
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: No such file or directory
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: exec: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: cannot execute: No such file or directory

---- Error Log ----
>


Another attempt, after I set some config values, as suggested:


>
> config$spark.driver.cores<- 4
>
> config$spark.executor.cores<- 4
>
> config$executor.memory <- "4G"
>
> config
$sparklyr.cores.local
[1] 16

$spark.sql.shuffle.partitions.local
[1] 16

$spark.env.SPARK_LOCAL_IP.local
[1] "127.0.0.1"

$sparklyr.csv.embedded
[1] "^1.*"

$`sparklyr.shell.driver-class-path`
[1] ""

$spark.driver.cores
[1] 4

$spark.executor.cores
[1] 4

$executor.memory
[1] "4G"

attr(,"config")
[1] "default"
attr(,"file")
[1] "/people/rtaylor/Rpackages/sparklyr/conf/config-template.yml"
>
>
> spark_version
[1] "1.6.0"
>
> spark_home
[1] "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41"
>
> spark_connect(master = "yarn-client", config = config, version = spark_version, spark_home=spark_home)
Error in force(code) :
  Failed while connecting to sparklyr to port (8880) for sessionid (533): Gateway in port (8880) did not respond.
    Path: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit
    Parameters: --class, sparklyr.Backend, --jars, '/people/rtaylor/Rpackages/sparklyr/java/spark-csv_2.11-1.3.0.jar','/people/rtaylor/Rpackages/sparklyr/java/commons-csv-1.1.jar','/people/rtaylor/Rpackages/sparklyr/java/univocity-parsers-1.5.1.jar', '/people/rtaylor/Rpackages/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 533


---- Output Log ----
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: No such file or directory
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: exec: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: cannot execute: No such file or directory

---- Error Log ----
>

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 
Guidance that I got from Mr. Ariga in response to my gigthub posting:

On Fri, Mar 17, 2017 at 6:34 AM, Aki Ariga wrote:

I ran Spark 1.6.2 on CDH 5.10 as follows:

config <- spark_config()
config$spark.driver.cores   <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
spark_home <- "/opt/cloudera/parcels/CDH/lib/spark"
spark_version <- "1.6.2"
#spark_home <- "/opt/cloudera/parcels/SPARK2/lib/spark2"
#spark_version <- "2.0.0"
sc <- spark_connect(master="yarn-client", version=spark_version, config=config, spark_home=spark_home)

See also http://blog.cloudera.com/blog/2017/02/analyzing-us-flight-data-on-amazon-s3-with-sparklyr-and-apache...



 %%%%%%%%
 
1 REPLY 1
Highlighted

Re: failure on spark_connect() for sparklyr R package use on a Cloudera CDH 5.10.0 Hadoop cluster

Cloudera Employee

 

Hi, I'm Aki. Thanks for your posting.

 

According to your output message, I found your spark_home setting is wrong.

  

/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: No such file or directory

 

`spark-class` should found in `/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/bin/spark-class`.  So, you should set

 

 

/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark

 

 

 for `spark_home` not 

 

 

/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib

 

 

, or you can use just

 

 

/opt/cloudera/parcels/CDH/lib/spark

 

So, you can set an environmental variable as follows:

 
Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH/lib/spark")

 

or, could you try to copy and paste following code? As far as I saw your output, it seems you didn't try my code. I don't know the difference between patch version of spark would affect to connection, but I guess you may be able to replace spark_version as "1.6.0".

 

config <- spark_config()
config$spark.driver.cores   <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
spark_home <- "/opt/cloudera/parcels/CDH/lib/spark"
spark_version <- "1.6.2"
sc <- spark_connect(master="yarn-client", version=spark_version, config=config, spark_home=spark_home)