Support Questions

Find answers, ask questions, and share your expertise

RHadoop on CDH5.3, rmr2 and rhdfs packages are not available for R 3.2.1

avatar
Explorer

Hi,

 

I would like to try RHadoop on CDH 5.3 that's why I should install the "rmr2" and "rhdfs" packages under R but I  was surprised that these two packages are not available for the recent version of R(3.2.1) which is installed on Centos.

 

What should I do I need these two packages urgently!

 

Best regards.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

I haven't tried those packages in a while (not since R 3.2 at least), but I know they haven't been updated in a while: https://github.com/RevolutionAnalytics/rhdfs/releases It wouldn't surprise me if they're not maintained now, especially given Revolution is probably shifting gears now that they're part of MSFT. I don't know. It's really a question for Revo or those open source projects.

 

Not sure if it helps, but here's a way you could use local Hadoop binaries to read from HDFS and then just pipe the result into R.

 

Edit your ~/.Renviron to set up Hadoop env variables. For me on my Mac it's:

 

HADOOP_CMD=/usr/local/Cellar/hadoop/2.7.1/bin/hadoop
HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.1/
HADOOP_CONF_DIR=/Users/srowen/Documents/Cloudera/hadoop-conf/

 

where hadoop-conf is a copy of the config directory from my cluster.

 

Then in R something like:

 

data <- read.csv(pipe("hdfs dfs -cat /user/sowen/data/part-*"), header=FALSE))

You get the idea.

 

For rmr2: I'd suggest you don't really want to run MapReduce. 🙂 It's pretty easy to trigger an R script from, say, a Spark task and parallelize a bunch of R scripts across the cluster with its "pipe" command. That's roughly what rmr2 helps you do. You still have to set up R across the cluster.

 

There's also SparkR on the way, but still pretty green.

View solution in original post

3 REPLIES 3

avatar
Master Collaborator

I haven't tried those packages in a while (not since R 3.2 at least), but I know they haven't been updated in a while: https://github.com/RevolutionAnalytics/rhdfs/releases It wouldn't surprise me if they're not maintained now, especially given Revolution is probably shifting gears now that they're part of MSFT. I don't know. It's really a question for Revo or those open source projects.

 

Not sure if it helps, but here's a way you could use local Hadoop binaries to read from HDFS and then just pipe the result into R.

 

Edit your ~/.Renviron to set up Hadoop env variables. For me on my Mac it's:

 

HADOOP_CMD=/usr/local/Cellar/hadoop/2.7.1/bin/hadoop
HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.1/
HADOOP_CONF_DIR=/Users/srowen/Documents/Cloudera/hadoop-conf/

 

where hadoop-conf is a copy of the config directory from my cluster.

 

Then in R something like:

 

data <- read.csv(pipe("hdfs dfs -cat /user/sowen/data/part-*"), header=FALSE))

You get the idea.

 

For rmr2: I'd suggest you don't really want to run MapReduce. 🙂 It's pretty easy to trigger an R script from, say, a Spark task and parallelize a bunch of R scripts across the cluster with its "pipe" command. That's roughly what rmr2 helps you do. You still have to set up R across the cluster.

 

There's also SparkR on the way, but still pretty green.

avatar
Explorer

Thank you so much "Sowen" for your suggestions !

avatar
New Contributor
Sowen, I am new to R and its configuration, i am working on a project where i need to import some data from hdfs(Cloudera CDH clister) on to my Windows running R/Rstudio environment.
I had r/rstudio installed on windows and while trying to install rhdfs package which requires setting up of HADOOP_CMD environment variable pointing to hadoop binaries.

My Hadoop cluster is running on Linux and any suggestions how i can set this HADOOP_CMD variable pointing to hadoop binaries on my windows running R environment? thank you!