Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Guru

In Data Science, R is commonly used for analytics and data exploration.

When moving to a Hadoop architecture and a connected data platform a big question is what happens to my already existing R scripts?

You can transition nicely to Hadoop using the rHadoop package for R which allows you to read from hdfs and get data back into a dataframe in R.

To enable this you first need to get the R package:

install.packages("rhadoop")

You can also wget the package

wget https://cran.r-project.org/src/contrib/Archive/rHadoopClient/rHadoopClient_0.2.tar.gz

and then

install.packages("/path/to/package")

now you can read a file in using the rHadoopClient:

rHadoopClient::read.hdfs("/path/to/data.csv")

That's all you need to get started.

This allows you to change your file read steps in your R scripts to point to HDFS and still run your R scripts as you are used to doing.

10,512 Views
0 Kudos
Comments
Not applicable

Hi Vasilis, does this method you have outlined only work when R is installed on an Edge node of the HDP cluster (i.e. R and HDFS are colocated)? I'm exploring how R (say installed in a workstation) can connect to HDFS running on a separate/remote server(s), in which case, I'm unsure how to define the connection details to Hadoop. Are you able to assist?

Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎05-18-2016 07:32 PM
Updated by:
Guru vnv Guru
 
Contributors
Top Kudoed Authors