Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
Labels (1)

HDP Sandbox comes pre-installed with SparkR and R .

First let's setup the R Studio on the HDP Sandbox:

To download and install RStudio Server open a terminal window and execute the commands corresponding to get the 64-bit version

wget https://download2.rstudio.org/rstudio-server-rhel-0.99.893-x86_64.rpm

sudo yum install --nogpgcheck rstudio-server-rhel-0.99.893-x86_64.rpm

sudo rstudio-server verify-installation

sudo rstudio-server stop

sudo rstudio-server start

Now setup a local user on the hdp sandbox, to access the rstudio:

useradd alex

passwd xxxx

Next launch web-browser and point to : http://sandbox.hortonworks.com:8787/

Login using the local account created earlier above.

3135-screen-shot-2016-04-01-at-124012-pm.png

Next lets initialize the sparkr in "yarn-client" mode

Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client/")


.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths()))


library(SparkR)


sc <- SparkR::sparkR.init(master = "yarn-client") 

Run the code in rstudio , you should get the following output:

> 
> 
> Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client/")
> 
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"),.libPaths()))
> 
> library(SparkR)

Attaching package: ‘SparkR’

The following objects are masked from ‘package:stats’:

    cov, filter, lag, na.omit, predict, sd, var

The following objects are masked from ‘package:base’:

    colnames, colnames<-, intersect, rank, rbind, sample, subset, summary, table, transform

> sc <- SparkR::sparkR.init(master = "yarn-client")
Launching java with spark-submit command /usr/hdp/current/spark-client//bin/spark-submit   sparkr-shell /tmp/RtmpVvKWS8/backend_port38582cab538c 
16/04/01 16:28:00 INFO SparkContext: Running Spark version 1.6.0
16/04/01 16:28:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/01 16:28:01 INFO SecurityManager: Changing view acls to: azeltov
16/04/01 16:28:01 INFO SecurityManager: Changing modify acls to: azeltov
16/04/01 16:28:01 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(azeltov); users with modify permissions: Set(azeltov)
16/04/01 16:28:02 INFO Utils: Successfully started service 'sparkDriver' on port 51539.
16/04/01 16:28:02 INFO Slf4jLogger: Slf4jLogger started
16/04/01 16:28:02 INFO Remoting: Starting remoting
16/04/01 16:28:03 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.197.154:54056]
16/04/01 16:28:03 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 54056.
16/04/01 16:28:03 INFO SparkEnv: Registering MapOutputTracker
16/04/01 16:28:03 INFO SparkEnv: Registering BlockManagerMaster
16/04/01 16:28:03 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-533d709d-f345-4d26-8046-0ae68a009d13
16/04/01 16:28:03 INFO MemoryStore: MemoryStore started with capacity 511.5 MB
16/04/01 16:28:03 INFO SparkEnv: Registering OutputCommitCoordinator
16/04/01 16:28:03 INFO Server: jetty-8.y.z-SNAPSHOT
16/04/01 16:28:03 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040
16/04/01 16:28:03 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/04/01 16:28:03 INFO SparkUI: Started SparkUI at http://192.168.197.154:4040
spark.yarn.driver.memoryOverhead is set but does not apply in client mode.
16/04/01 16:28:04 INFO TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/
16/04/01 16:28:04 INFO RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/192.168.197.154:8050
16/04/01 16:28:05 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
16/04/01 16:28:05 INFO Client: Requesting a new application from cluster with 1 NodeManagers
16/04/01 16:28:05 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (5120 MB per container)
16/04/01 16:28:05 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
16/04/01 16:28:05 INFO Client: Setting up container launch context for our AM
16/04/01 16:28:05 INFO Client: Setting up the launch environment for our AM container
16/04/01 16:28:05 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar
16/04/01 16:28:05 INFO Client: Preparing resources for our AM container
16/04/01 16:28:05 INFO Client: Using the spark assembly jar on HDFS because you are using HDP, defaultSparkAssembly:hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar
16/04/01 16:28:05 INFO Client: Source and destination file systems are the same. Not copying hdfs://sandbox.hortonworks.com:8020/hdp/apps/2.4.0.0-169/spark/spark-hdp-assembly.jar
16/04/01 16:28:06 INFO Client: Uploading resource file:/tmp/spark-974da35e-9ded-485a-9a13-0fd9a020dbf0/__spark_conf__5714702140855539619.zip -> hdfs://sandbox.hortonworks.com:8020/user/azeltov/.sparkStaging/application_1459455786854_0007/__spark_conf__5714702140855539619.zip
16/04/01 16:28:06 INFO SecurityManager: Changing view acls to: azeltov
16/04/01 16:28:06 INFO SecurityManager: Changing modify acls to: azeltov
16/04/01 16:28:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(azeltov); users with modify permissions: Set(azeltov)
16/04/01 16:28:06 INFO Client: Submitting application 7 to ResourceManager
16/04/01 16:28:06 INFO YarnClientImpl: Submitted application application_1459455786854_0007
16/04/01 16:28:06 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1459455786854_0007 and attemptId None
16/04/01 16:28:07 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:07 INFO Client: 
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: N/A
  ApplicationMaster RPC port: -1
  queue: default
  start time: 1459528086330
  final status: UNDEFINED
  tracking URL: http://sandbox.hortonworks.com:8088/proxy/application_1459455786854_0007/
  user: azeltov
16/04/01 16:28:08 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:09 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:10 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:11 INFO Client: Application report for application_1459455786854_0007 (state: ACCEPTED)
16/04/01 16:28:12 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
16/04/01 16:28:12 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> sandbox.hortonworks.com, PROXY_URI_BASES -> http://sandbox.hortonworks.com:8088/proxy/application_1459455786854_0007), /proxy/application_1459455786854_0007
16/04/01 16:28:12 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
16/04/01 16:28:12 INFO Client: Application report for application_1459455786854_0007 (state: RUNNING)
16/04/01 16:28:12 INFO Client: 
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: 192.168.197.154
  ApplicationMaster RPC port: 0
  queue: default
  start time: 1459528086330
  final status: UNDEFINED
  tracking URL: http://sandbox.hortonworks.com:8088/proxy/application_1459455786854_0007/
  user: azeltov
16/04/01 16:28:12 INFO YarnClientSchedulerBackend: Application application_1459455786854_0007 has started running.
16/04/01 16:28:12 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35844.
16/04/01 16:28:12 INFO NettyBlockTransferService: Server created on 35844
16/04/01 16:28:12 INFO BlockManagerMaster: Trying to register BlockManager
16/04/01 16:28:12 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.197.154:35844 with 511.5 MB RAM, BlockManagerId(driver, 192.168.197.154, 35844)
16/04/01 16:28:12 INFO BlockManagerMaster: Registered BlockManager
16/04/01 16:28:12 INFO EventLoggingListener: Logging events to hdfs:///spark-history/application_1459455786854_0007
16/04/01 16:28:18 INFO YarnClientSchedulerBackend: Registered executor NettyRpcEndpointRef(null) (sandbox.hortonworks.com:43282) with ID 1
16/04/01 16:28:18 INFO BlockManagerMasterEndpoint: Registering block manager sandbox.hortonworks.com:58360 with 511.5 MB RAM, BlockManagerId(1, sandbox.hortonworks.com, 58360)
16/04/01 16:28:33 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)

Validate that the sparkUI is started, look for this message in your output like the one above:

16/04/01 16:28:03 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/04/01 16:28:03 INFO SparkUI: Started SparkUI at http://192.168.197.154:4040

3134-screen-shot-2016-04-01-at-123937-pm.png

Next lets try running some sparkR code:

sqlContext <-sparkRSQL.init(sc)

path <-file.path("file:///usr/hdp/2.4.0.0-169/spark/examples/src/main/resources/people.json")

peopleDF <-jsonFile(sqlContext, path)

printSchema(peopleDF)
You should get this output:
> printSchema(peopleDF)
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

Now lets try to do some basic dataframe analysis 
# Register this DataFrame as a table.
registerTempTable(peopleDF, "people")
teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")
teenagersLocalDF <- collect(teenagers)
print(teenagersLocalDF)
8,019 Views
Comments
New Contributor

Hi @azeltov,

Could you run R-Studio remote, i.e. on your laptop and connect to a cluster?

Cheers,

Christian

New Contributor

I installed HDP2.4 for Virtual box and it is telling me that R is not found. Are you sure that R is preinstalled? What is the fix for this?

New Contributor

This did not work. When I verify my installation it gives me: Unable to find an installation of R on the system (which R didn't return valid output); Unable to locate R binary by scanning standard locations.

What to do?

It seems the new version of Sandbox does not have R pre-installed. Its an easy installation procedure :

sudo yum install -y epel-release 
sudo yum update -y 
sudo yum install -y R

It seems the new version of Sandbox does not have R pre-installed. Its an easy installation procedure:

sudo yum install -y epel-release 
sudo yum update -y 
sudo yum install -y R
Super Collaborator

HI @azeltov, I am trying to install R-studio on Hortonworks sandbox 2.5, running through the exception in verify installation step:

initctl: Unable to connect to Upstart: Failed to connect to socket /com/ubuntu/upstart: Connection refused

I have tried starting, stopping rstudio server, it shows the same message.

PS: Since it is a docker container, 8787 port is not opened so I have configured /etc/rstudio/rserver.conf to use port 9000.

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
2 of 2
Last update:
‎08-17-2019 12:57 PM
Updated by:
 
Contributors
Top Kudoed Authors