Created on 09-02-2015 03:10 PM - edited 09-16-2022 02:39 AM
Hi,
I would like to try RHadoop on CDH 5.3 that's why I should install the "rmr2" and "rhdfs" packages under R but I was surprised that these two packages are not available for the recent version of R(3.2.1) which is installed on Centos.
What should I do I need these two packages urgently!
Best regards.
Created 09-02-2015 03:27 PM
I haven't tried those packages in a while (not since R 3.2 at least), but I know they haven't been updated in a while: https://github.com/RevolutionAnalytics/rhdfs/releases It wouldn't surprise me if they're not maintained now, especially given Revolution is probably shifting gears now that they're part of MSFT. I don't know. It's really a question for Revo or those open source projects.
Not sure if it helps, but here's a way you could use local Hadoop binaries to read from HDFS and then just pipe the result into R.
Edit your ~/.Renviron to set up Hadoop env variables. For me on my Mac it's:
HADOOP_CMD=/usr/local/Cellar/hadoop/2.7.1/bin/hadoop HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.1/ HADOOP_CONF_DIR=/Users/srowen/Documents/Cloudera/hadoop-conf/
where hadoop-conf is a copy of the config directory from my cluster.
Then in R something like:
data <- read.csv(pipe("hdfs dfs -cat /user/sowen/data/part-*"), header=FALSE))
You get the idea.
For rmr2: I'd suggest you don't really want to run MapReduce. 🙂 It's pretty easy to trigger an R script from, say, a Spark task and parallelize a bunch of R scripts across the cluster with its "pipe" command. That's roughly what rmr2 helps you do. You still have to set up R across the cluster.
There's also SparkR on the way, but still pretty green.
Created 09-02-2015 03:27 PM
I haven't tried those packages in a while (not since R 3.2 at least), but I know they haven't been updated in a while: https://github.com/RevolutionAnalytics/rhdfs/releases It wouldn't surprise me if they're not maintained now, especially given Revolution is probably shifting gears now that they're part of MSFT. I don't know. It's really a question for Revo or those open source projects.
Not sure if it helps, but here's a way you could use local Hadoop binaries to read from HDFS and then just pipe the result into R.
Edit your ~/.Renviron to set up Hadoop env variables. For me on my Mac it's:
HADOOP_CMD=/usr/local/Cellar/hadoop/2.7.1/bin/hadoop HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.1/ HADOOP_CONF_DIR=/Users/srowen/Documents/Cloudera/hadoop-conf/
where hadoop-conf is a copy of the config directory from my cluster.
Then in R something like:
data <- read.csv(pipe("hdfs dfs -cat /user/sowen/data/part-*"), header=FALSE))
You get the idea.
For rmr2: I'd suggest you don't really want to run MapReduce. 🙂 It's pretty easy to trigger an R script from, say, a Spark task and parallelize a bunch of R scripts across the cluster with its "pipe" command. That's roughly what rmr2 helps you do. You still have to set up R across the cluster.
There's also SparkR on the way, but still pretty green.
Created 09-10-2015 02:12 AM
Thank you so much "Sowen" for your suggestions !
Created 10-20-2015 07:58 AM