I have 3 node cluster having Cloudera 5.9 running on CentOS 6.7. I need to connect my R packages (running on my Laptop) to the Spark runing in cluster mode on Hadoop.
However If I try to connect the local R through Sparklyr Connect to Hadoop Spark it is giving Error. As it is searching the Spark home on the laptop itself.
I googled and found we can install SparkR and use R with Spark. However I have few questions regarding the same.
Please help, I am new in this and really need guidance.
Thanks for the reply @srowen
The best way to install R and then install SparkR on top of it is here : http://blog.clairvoyantsoft.com/2016/11/installing-sparkr-on-a-hadoop-cluster/
I was able to install them following this link. It is really useful and latest.
This link is opening fine http://site.clairvoyantsoft.com/installing-sparkr-on-a-hadoop-cluster/
but Step f under installation : link is not working as expected https://github.com/apache/spark/archive/.
Can you please provide the location we are using CDH 6.3.3 ans spark version is 2.4.0
Please replace steps f to j with what @singh101 suggested in one of the above comments:
https://community.cloudera.com/t5/Support-Questions/Run-SparkR-or-R-package-on-my-Cloudera-5-9-Spark... . The idea is - we make use of the binaries from the CDH parcel, instead of downloading it from upstream.
On a side note: CDP Base provides sparkR out of the box (in case if you plan to upgrade in near future)