i have configured a cluster with 4nodes on CDH5.3 and now i want to install RHadoop on that cluster but i didn't find documentation on how to do that.
Can you provide me some details ?
I don't think there's anything special to know, beyond what's documented in the RHadoop subprojects. So it's not something that we ship, support or document separately. I have set up the rhadoop libraries with CDH and it's straightforward.
It's really a set of client side libraries that you install into *R*, not *Hadoop*. However to run rmr2 you will need R installed locally on all of your Hadoop cluster nodes, since it will run MapReduce jobs that execute R scripts.
I recall that you have to install a bunch of other R packages before installing the rhdfs/rhbase/plyrmr libraries, and I found this in my notes as the set of prerequisites:
install.packages(c("Rcpp", "RJSONIO", "bitops", "digest",
"functional", "reshape2", "stringr", "plyr", "caTools", "rJava",
"dplyr", "R.methodsS3", "Hmisc"))
I am working on installing R and RStudio on CDH-5.3.2, but I found one issue when I install rmr2
install.packages("/home/ec2-user/R/rmr2_3.3.1.tar.gz", repos = NULL, type="source")
[javac] /tmp/RtmpVvuf0G/R.INSTALL341d4f985503/rmr2/src/hbase-io/src/java/com/dappervision/hbase/mapred/TypedBytesTableInputFormatBase.java:164: error: cannot find symbol
[javac] String regionLocation = table.getRegionLocation(startKeys[startPos]).
[javac] symbol: method getServerAddress()
[javac] location: class HRegionLocation
In the source code, line 164 is like this:
String regionLocation = table.getRegionLocation(startKeys[startPos]).
I searched API and could not find method getServerAddress() for HRegionLocation.
The problem is that I download rmr2 from this link https://github.com/RevolutionAnalytics/RHadoop/wiki (as in this instruction: https://ashokharnal.wordpress.com/2014/01/16/installing-r-rhadoop-and-rstudio-over-cloudera-hadoop-e... So the issue could be this tar.gz file is for CDH-4.
Do you know where can I download source code for CDH-5 ?
I suspect it is because the rmr2 integration code is compatible with an older version of HBase than what is shipped in CDH 5.3.
The link you cited returns a 404 for me, but, it seems to me that you are in fact using the latest rmr2 and building from source, which is the right thing to do. I have installed rmr2 on CDH 5.2 before. There aren't special versions you need to find.
I dug out my notes to myself on how I installed several of these libs before. Maybe they help? For example I installed them differently with R CMD. Of course you may wish to use later and more recent versions of these libraries than what's mentioned in the notes.
Basically you just... export HADOOP_CMD=`which hadoop` R ... library(plyrmr) and go to it. HOW TO Copy packages rmr2_3.1.0.tar.gz rhdfs_1.0.8.tar.gz plyrmr_0.2.0.tar.gz to nodes at, say, /tmp. For each node: export HADOOP_CMD=`which hadoop` export HADOOP_STREAMING=`ls /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming-*.jar` As root, install R: yum install R This installs version 3.0.2 on my cluster. Run R to install some dependencies R --vanilla Once in R: install.packages(c("Rcpp", "RJSONIO", "bitops", "digest", "functional", "reshape2", "stringr", "plyr", "caTools", "rJava", "dplyr", "R.methodsS3", "Hmisc")) (choose a mirror that's local when you are prompted) Install packages, back on the command line: R CMD INSTALL /tmp/rmr2_3.1.0.tar.gz R CMD INSTALL /tmp/rhdfs_1.0.8.tar.gz R CMD INSTALL /tmp/plyrmr_0.2.0.tar.gz
The rmr2 and rhdfs version I downloaded are:
-rw-r--r-- 1 ec2-user ec2-user 28287 Apr 10 18:24 plyrmr_0.6.0.tar.gz
-rw-r--r-- 1 ec2-user ec2-user 25105 Apr 10 18:24 rhdfs_1.0.8.tar.gz
-rw-r--r-- 1 ec2-user ec2-user 63087 Apr 10 18:24 rmr2_3.3.1.tar.gz
And my 5 nodes cluster in EC2:
[root@ip-172-30-2-9 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.5 (Santiago)
These directions are good.
But when i try to install
R CMD INSTALL /tmp/rmr2_3.1.0.tar.gz R CMD INSTALL /tmp/rhdfs_1.0.8.tar.gz R CMD INSTALL /tmp/plyrmr_0.2.0.tar.gz
I get an error [root@hostname:username]# R CMD INSTALL rmr2_3.1.0.tar.gz Error in getOctD(x, offset, len) : invalid octal digit
I have tried to use different reppos but im at a loss.
Any thoughts would be help full.
I suspect it's some issue in the version of tar you may have on your system? BSD vs Gnu? Just a guess. That or maybe a corrupted file? The latest rmr2 archive uncompressed OK for me on OS X. https://github.com/RevolutionAnalytics/rmr2/releases
When I try to install rmr2_3.3.1.tar.gz into to CDH5.7.4 I am getting following error. Can you help?
Thank you very much,
build_linux.sh: line 163: [: missing `]'
Using /ec2_oth/cloudera/parcels/CDH-5.7.4-1.cdh5.7.4.p0.2 as hadoop home
Using /ec2_oth/cloudera/parcels/CDH-5.7.4-1.cdh5.7.4.p0.2/lib/hbase as hbase home
Copying libs into local build directory
ls: cannot access /ec2_oth/cloudera/parcels/CDH-5.7.4-1.cdh5.7.4.p0.2/hadoop-*-core.jar: No such file or directory
Cannot find hadoop-streaming jar in hadoop homei
cp: cannot stat `build/dist/*': No such file or directory
can't build hbase IO classes, skipping
installing to /usr/lib64/R/library/rmr2/libs
** byte-compile and prepare package for lazy loading
Warning in library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘quickcheck’
Note: no visible binding for '<<-' assignment to '.Last'
Note: no visible binding for '<<-' assignment to '.Last'
*** installing help indices
converting help for package ‘rmr2’
finding HTML links ... done
** building package indices
** testing if installed package can be loaded
Warning: S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found
* DONE (rmr2)