Support Questions

Find answers, ask questions, and share your expertise

failure on spark_connect() for sparklyr R package use on a Cloudera CDH 5.10.0 Hadoop cluster



Hi folks,
I am trying to get the sparklyr R package to work with Spark on a local Linux cluster. Works fine under Spark on my laptop - but now I want to parallelize R code  on a actual cluster. I posted the message below as a new issue on the sparklyr page at github over a week ago, and I got one reply. Which, alas, did not succed in solving my problem. So  I am posting again here, in the hope somebody reading this can provide guidance. I need to get R working in Spark on our cluster. Eager to get going, if I can get out of the gate. But have to get past this spark_connect() problem. Please see the below.
   - Ron Taylor
   Pacific Northwest National Laboratory
This is my earlier msg posted at the sparklyr page at github:
Hello folks,
I am trying to use sparklyr for the first time on a Hadoop cluster. I have used R with sparklyr on a "local" copy of Spark on my Mac laptop, but this is the first time that I am trying to run it as a "yarn-client" on a true cluster, to actually get some parallelization out of sparklyr use.
We have a small Linux cluster at our lab running Cloudera CDH 5.10.0. When I try to do the spark_connect() from an R session started on a command line on the Hadoop cluster's name (master) node, I get the same msg as in an earlier closed issue on the sparklyr github site.
That is, my error msg is:
"Failed while connecting to sparklyr to port (8880) for sessionid (2423): Gateway in port (8880) did not respond."
I am thus reopening that issue here, since I still need help even after reading that older github issue (#394).
At bottom is the record of my R session on the Hadoop cluster's name node, with all the details that I can think of printed out to the screen.
I note that the version of Spark used by CDH is 1.6.0, which is different than what is in spark_home_dir (1.6.2). I cannot seem to change the spark_home_dir by setting SPARK_HOME to the Spark location used by the CDH distribution. spark_home_dir does not get altered by my setting of SPARK_HOME (as you can see below). So one question (perhaps the critical question?) is: how do I force sparklyr to connect to the Spark version being used by the CDH distribution?
the 5.10.0 distribution has Spark 1.6.0, not 1.6.2.
So I am trying to tell sparklyr code to use the Spark 1.6.0 distribution that is located here:
and so I was trying to set SPARK_HOME as follows:
Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41")
[1] "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41"
However! I note that part of the error msg (see bottom) says that the correct path was used to spark-submit:
"Path: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit"
So maybe sparklyr is indeed accessing the Spark 1.6.0 distribution as it should in the cluster, and the problem lies elsewhere??
One other note: there was an earlier version of sparklyr installed by support here on the Hadoop name node. I have bypassed that, installed the latest version of sparklyr (0.5.1) into
as you can see below.
Would very much appreciate some guidance to get me over this initial hurdle.
  • Ron Taylor
    Pacific Northwest National Laboratory

Here is the captured screen output from the R session with my failed spark_connect() call  (with the R session being done at the Linux command prompt on the Hadoop cluster namenode):
[rtaylor@bigdatann Rwork]$ R
R version 3.3.2 (2016-10-31) -- "Sincere Pumpkin Patch"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
[1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
[7] "base"
install.packages("sparklyr", lib="/people/rtaylor/Rpackages/")
--- Please select a CRAN mirror for use in this session ---
trying URL ''
Content type 'application/x-gzip' length 732806 bytes (715 KB)
downloaded 715 KB
  • installing source package ‘sparklyr’ ...
    ** package ‘sparklyr’ successfully unpacked and MD5 sums checked
    ** R
    ** inst
    ** preparing package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    ** testing if installed package can be loaded
  • DONE (sparklyr)
The downloaded source packages are in
library(sparklyr, lib.loc="/people/rtaylor/Rpackages/")
[1] "sparklyr" "stats" "graphics" "grDevices" "utils" "datasets"
[7] "methods" "base"
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Workstation release 6.4 (Santiago)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sparklyr_0.5.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 withr_1.0.2 digest_0.6.12 dplyr_0.5.0
[5] rprojroot_1.2 assertthat_0.1 rappdirs_0.3.1 R6_2.2.0
[9] jsonlite_1.2 DBI_0.5-1 backports_1.0.5 magrittr_1.5
[13] httr_1.2.1 config_0.2 tools_3.3.2 parallel_3.3.2
[17] yaml_2.1.14 base64enc_0.1-3 tcltk_3.3.2 tibble_1.2
[1] "/usr/java/latest"
spark hadoop dir
1 1.6.2 2.6 spark-1.6.2-bin-hadoop2.6
[1] "/people/rtaylor/.cache/spark/spark-1.6.2-bin-hadoop2.6"
R.home(component = "home")
[1] "/share/apps/R/3.3.2/lib64/R"
[1] "/people/rtaylor"
Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41")
[1] "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41"
[1] "/people/rtaylor/.cache/spark/spark-1.6.2-bin-hadoop2.6"
config <- spark_config()
[1] "config"
And, finally, here is the actual spark_connect() command where I see the failure:
sc <- spark_connect(master = "yarn-client", config = config, version = "1.6.0")
Error in force(code) :
Failed while connecting to sparklyr to port (8880) for sessionid (2423): Gateway in port (8880) did not respond.
Path: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit
Parameters: --class, sparklyr.Backend, --jars, '/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/spark-csv_2.11-1.3.0.jar','/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/commons-csv-1.1.jar','/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/univocity-parsers-1.5.1.jar', '/share/apps/R/3.3.2/lib64/R/library/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 2423
---- Output Log ----
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: No such file or directory
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: exec: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: cannot execute: No such file or directory
---- Error Log ----
And below you can see the the results of a similar R session, which I just tried today after hearing back from Mr. Ariga via github on my posted issue, and trying what he suggested. Again, I failed. Still need help.

---------- Forwarded message ----------
From: Ronald Taylor
Date: Sun, Mar 19, 2017 at 5:23 PM
Subject: Re: [rstudio/sparklyr] problem with spark_connect() using sparklyr on a Cloudera CDH 5.10.0 Hadoop cluster (#534)
To: rstudio/sparklyr <reply+00525e71db98e9f7adda4ea9e8bcc457e983afabaf2725ba92cf0000000114e3a7e692a169ce0ca76ebc@reply.git...>

Hi Aki,

Thanks for the guidance, but  I still cannot get spark_connect() to work. Very disappointing.

You can see the screen output for my connect attempts below. Also,  I checked out the Cloudera web page that you listed - but I don't see anything there that usefully supplements your email to me.

And so I am still stuck. Can you (or anbody else on the list) think of of anything else I can try? Spark 1.6.0  is running fine on the Cloudera cluster that I am trying to use, according to the Cloudera Manager. So the spark_connect() *should* work, but is not.
  - Ron

Screen output:

> config <- spark_config()

> spark_home <- "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41"
> spark_home
[1] "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41"
> spark_version <- "1.6.0"
> spark_version
[1] "1.6.0"
> sc <- spark_connect(master = "yarn-client", config = config, version = spark_version, spark_home=spark_home)
Error in force(code) :
  Failed while connecting to sparklyr to port (8880) for sessionid (8451): Gateway in port (8880) did not respond.
    Path: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit
    Parameters: --class, sparklyr.Backend, --jars, '/people/rtaylor/Rpackages/sparklyr/java/spark-csv_2.11-1.3.0.jar','/people/rtaylor/Rpackages/sparklyr/java/commons-csv-1.1.jar','/people/rtaylor/Rpackages/sparklyr/java/univocity-parsers-1.5.1.jar', '/people/rtaylor/Rpackages/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 8451

---- Output Log ----
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: No such file or directory
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: exec: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: cannot execute: No such file or directory

---- Error Log ----

Another attempt, after I set some config values, as suggested:

> config$spark.driver.cores<- 4
> config$spark.executor.cores<- 4
> config$executor.memory <- "4G"
> config
[1] 16

[1] 16

[1] ""

[1] "^1.*"

[1] ""

[1] 4

[1] 4

[1] "4G"

[1] "default"
[1] "/people/rtaylor/Rpackages/sparklyr/conf/config-template.yml"
> spark_version
[1] "1.6.0"
> spark_home
[1] "/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41"
> spark_connect(master = "yarn-client", config = config, version = spark_version, spark_home=spark_home)
Error in force(code) :
  Failed while connecting to sparklyr to port (8880) for sessionid (533): Gateway in port (8880) did not respond.
    Path: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-submit
    Parameters: --class, sparklyr.Backend, --jars, '/people/rtaylor/Rpackages/sparklyr/java/spark-csv_2.11-1.3.0.jar','/people/rtaylor/Rpackages/sparklyr/java/commons-csv-1.1.jar','/people/rtaylor/Rpackages/sparklyr/java/univocity-parsers-1.5.1.jar', '/people/rtaylor/Rpackages/sparklyr/java/sparklyr-1.6-2.10.jar', 8880, 533

---- Output Log ----
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: No such file or directory
/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: exec: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: cannot execute: No such file or directory

---- Error Log ----

Guidance that I got from Mr. Ariga in response to my gigthub posting:

On Fri, Mar 17, 2017 at 6:34 AM, Aki Ariga wrote:

I ran Spark 1.6.2 on CDH 5.10 as follows:

config <- spark_config()
config$spark.driver.cores   <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
spark_home <- "/opt/cloudera/parcels/CDH/lib/spark"
spark_version <- "1.6.2"
#spark_home <- "/opt/cloudera/parcels/SPARK2/lib/spark2"
#spark_version <- "2.0.0"
sc <- spark_connect(master="yarn-client", version=spark_version, config=config, spark_home=spark_home)

See also


Cloudera Employee


Hi, I'm Aki. Thanks for your posting.


According to your output message, I found your spark_home setting is wrong.


/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/../lib/spark/bin/spark-submit: line 27: /opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/bin/spark-class: No such file or directory


`spark-class` should found in `/opt/cloudera/parcels/CDH-5.10.0-1.cdh5.10.0.p0.41/lib/spark/bin/spark-class`.  So, you should set






 for `spark_home` not 






, or you can use just





So, you can set an environmental variable as follows:

Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH/lib/spark")


or, could you try to copy and paste following code? As far as I saw your output, it seems you didn't try my code. I don't know the difference between patch version of spark would affect to connection, but I guess you may be able to replace spark_version as "1.6.0".


config <- spark_config()
config$spark.driver.cores   <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
spark_home <- "/opt/cloudera/parcels/CDH/lib/spark"
spark_version <- "1.6.2"
sc <- spark_connect(master="yarn-client", version=spark_version, config=config, spark_home=spark_home)