Community Articles

Find and share helpful community-sourced technical articles.
avatar
Guru

In Data Science, R is commonly used for analytics and data exploration.

When moving to a Hadoop architecture and a connected data platform a big question is what happens to my already existing R scripts?

You can transition nicely to Hadoop using the rHadoop package for R which allows you to read from hdfs and get data back into a dataframe in R.

To enable this you first need to get the R package:

install.packages("rhadoop")

You can also wget the package

wget https://cran.r-project.org/src/contrib/Archive/rHadoopClient/rHadoopClient_0.2.tar.gz

and then

install.packages("/path/to/package")

now you can read a file in using the rHadoopClient:

rHadoopClient::read.hdfs("/path/to/data.csv")

That's all you need to get started.

This allows you to change your file read steps in your R scripts to point to HDFS and still run your R scripts as you are used to doing.

14,871 Views
0 Kudos
Comments
avatar
New Contributor

Hi Vasilis, does this method you have outlined only work when R is installed on an Edge node of the HDP cluster (i.e. R and HDFS are colocated)? I'm exploring how R (say installed in a workstation) can connect to HDFS running on a separate/remote server(s), in which case, I'm unsure how to define the connection details to Hadoop. Are you able to assist?