In Data Science, R is commonly used for analytics and data exploration.
When moving to a Hadoop architecture and a connected data platform a big question is what happens to my already existing R scripts?
You can transition nicely to Hadoop using the rHadoop package for R which allows you to read from hdfs and get data back into a dataframe in R.
To enable this you first need to get the R package:
You can also wget the package
now you can read a file in using the rHadoopClient:
That's all you need to get started.
This allows you to change your file read steps in your R scripts to point to HDFS and still run your R scripts as you are used to doing.
Hi Vasilis, does this method you have outlined only work when R is installed on an Edge node of the HDP cluster (i.e. R and HDFS are colocated)? I'm exploring how R (say installed in a workstation) can connect to HDFS running on a separate/remote server(s), in which case, I'm unsure how to define the connection details to Hadoop. Are you able to assist?