Created 09-20-2015 02:44 PM
Please accept my sincere salutations.
I installed a virtual machine Cloudera 5.3 where I installed also the RStudio.I did some tasks with RStudio on small dataset locally and now I like to repeat the same tasks with Big dataset(which means Big Data).In fact, I can't do it locally so I need to do it via cloud.
Could you advice me about the best solution to do thid job in the cloud.If Cloudera or Rstudio allow tasks to be transfomed in the cloud?What is the best free simulator'if it exists)?
Thank you for your suggestions!!
Created 09-21-2015 06:20 PM
Hi RMG,
There is a package called RHadoop which will allow you to write R programs that run in a Hadoop cluster. For more information, see: RHadoop
Before moving right to a cloud-based solution, I would suggest trying out RHadoop on your VM first. The best option, and the tool I use most, is Cloudera's Quickstart VM. You can download your preferred version of the VM and run it on your laptop. This is essentially a single node cluster; while you can't use it for Big Data processing, it is a great way to get some experience with RHadoop. I would recommend doing this as a learning step prior to creating a cluster in a public (or private) cloud.
To install RHadoop, google for instructions. I found these instructions, but have not tested: RHadoop Installation in Cloudera Quickstart VM
Finally, you can spin up a Hadoop cluster in your favorite public cloud and follow the same link above to install RHadoop in the cluster.
If you are unfamiliar with using public clouds, it is best to do some reading first. Take care when running your cluster - fees add up quickly if you leave your VMs (instances) running for a long time. Monitor the charges, shut down the VMs when not in use.
Also see Cloudera Director, Cloudera Live, and the Cloudera Demo tutorial here: http://www.cloudera.com/content/cloudera/en/products-and-services/cloudera-live.html
HTH
Created 09-21-2015 10:28 PM
Thank you for your suggestions Sue, that's what I did exactly: I installed RHADOOP under Cloudera also R and RStudio for doing my analysis(single node cluster).I terminated this step with success and now I search for a public and free cloud that allow me to run my code with large dataset. But what I find don't answer to my needs because all the clouds that I find are not free.
What can I do please!
Created 09-22-2015 04:36 AM
That's great RMG. I don't know of a free 'simulator', so can't help you there. However, cloud vendors do provide offers, such as Google for Google Compute Platform. There is currently a 60 day/$300 trial offer here: https://cloud.google.com/free-trial/ You may also find similar Amazon AWS trials available.
Created 07-07-2016 07:55 AM
I would recommend the SparkR package which works similarly as the dplyr package. I find it a lot easier to use than RHadoop which is still based on MapReduce under the hood. The big data community is moving rapidly towards Spark. For more information about SparkR please see the cloudera community post here under.
https://community.cloudera.com/t5/Data-Science-and-Machine/Spark-R-in-Cloudera-5-3-0/td-p/37706