Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

RHadoop on CDH5.3, rmr2 and rhdfs packages are not available for R 3.2.1

avatar
Explorer

Hi,

 

I would like to try RHadoop on CDH 5.3 that's why I should install the "rmr2" and "rhdfs" packages under R but I  was surprised that these two packages are not available for the recent version of R(3.2.1) which is installed on Centos.

 

What should I do I need these two packages urgently!

 

Best regards.

1 ACCEPTED SOLUTION

avatar
Master Collaborator

I haven't tried those packages in a while (not since R 3.2 at least), but I know they haven't been updated in a while: https://github.com/RevolutionAnalytics/rhdfs/releases It wouldn't surprise me if they're not maintained now, especially given Revolution is probably shifting gears now that they're part of MSFT. I don't know. It's really a question for Revo or those open source projects.

 

Not sure if it helps, but here's a way you could use local Hadoop binaries to read from HDFS and then just pipe the result into R.

 

Edit your ~/.Renviron to set up Hadoop env variables. For me on my Mac it's:

 

HADOOP_CMD=/usr/local/Cellar/hadoop/2.7.1/bin/hadoop
HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.1/
HADOOP_CONF_DIR=/Users/srowen/Documents/Cloudera/hadoop-conf/

 

where hadoop-conf is a copy of the config directory from my cluster.

 

Then in R something like:

 

data <- read.csv(pipe("hdfs dfs -cat /user/sowen/data/part-*"), header=FALSE))

You get the idea.

 

For rmr2: I'd suggest you don't really want to run MapReduce. 🙂 It's pretty easy to trigger an R script from, say, a Spark task and parallelize a bunch of R scripts across the cluster with its "pipe" command. That's roughly what rmr2 helps you do. You still have to set up R across the cluster.

 

There's also SparkR on the way, but still pretty green.

View solution in original post

3 REPLIES 3

avatar
Master Collaborator

I haven't tried those packages in a while (not since R 3.2 at least), but I know they haven't been updated in a while: https://github.com/RevolutionAnalytics/rhdfs/releases It wouldn't surprise me if they're not maintained now, especially given Revolution is probably shifting gears now that they're part of MSFT. I don't know. It's really a question for Revo or those open source projects.

 

Not sure if it helps, but here's a way you could use local Hadoop binaries to read from HDFS and then just pipe the result into R.

 

Edit your ~/.Renviron to set up Hadoop env variables. For me on my Mac it's:

 

HADOOP_CMD=/usr/local/Cellar/hadoop/2.7.1/bin/hadoop
HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.1/
HADOOP_CONF_DIR=/Users/srowen/Documents/Cloudera/hadoop-conf/

 

where hadoop-conf is a copy of the config directory from my cluster.

 

Then in R something like:

 

data <- read.csv(pipe("hdfs dfs -cat /user/sowen/data/part-*"), header=FALSE))

You get the idea.

 

For rmr2: I'd suggest you don't really want to run MapReduce. 🙂 It's pretty easy to trigger an R script from, say, a Spark task and parallelize a bunch of R scripts across the cluster with its "pipe" command. That's roughly what rmr2 helps you do. You still have to set up R across the cluster.

 

There's also SparkR on the way, but still pretty green.

avatar
Explorer

Thank you so much "Sowen" for your suggestions !

avatar
New Contributor
Sowen, I am new to R and its configuration, i am working on a project where i need to import some data from hdfs(Cloudera CDH clister) on to my Windows running R/Rstudio environment.
I had r/rstudio installed on windows and while trying to install rhdfs package which requires setting up of HADOOP_CMD environment variable pointing to hadoop binaries.

My Hadoop cluster is running on Linux and any suggestions how i can set this HADOOP_CMD variable pointing to hadoop binaries on my windows running R environment? thank you!