Reply
Highlighted
New Contributor
Posts: 1
Registered: ‎05-11-2017

How to pull data from Hadoop into R

Community, 

I have an application server with RHEL and Rstudio Server Pro that I manage. 

All I need to do for now is to use R to read/write data from our Hadoop cluster into R to process further on my application server. I found a lot of info on the internet on how to run R on the Hadoop nodes but that is not what I am looking for right now. 

 

Any ideas? What drivers and packages do I need to install to make this happen? 

 

Thank you. 

Cloudera Employee
Posts: 1
Registered: ‎07-10-2017

Re: How to pull data from Hadoop into R

Hello,

 

I understand that you would like to interact with data on cluster within R.

 

One idea is to use HttpFS with R curl package:

https://cran.r-project.org/web/packages/curl/vignettes/intro.html

 

Apache Hadoop HttpFS is a service that provides HTTP access to HDFS.

HttpFS has a REST HTTP API supporting all HDFS filesystem operations (both read and write).

Common HttpFS use cases are:

  • Read and write data in HDFS using HTTP utilities (such as curl or wget) and HTTP libraries from languages other than Java.
  • Transfer data between HDFS clusters running different versions of Hadoop (overcoming RPC versioning issues), for example using Hadoop DistCp.
  • Accessing WebHDFS using the Namenode WebUI port (default port 50070). Access to all data hosts in the cluster is required, because WebHDFS redirects clients to the datanode port (default 50075). If the cluster is behind a firewall, and you use WebHDFS to read and write data to HDFS, then Cloudera recommends you use the HttpFS server. The HttpFS server acts as a gateway. It is the only system that is allowed to send and receive data through the firewall.

A more ad-hoc solution would be to use the Cloudera Data Science Workbench. Have you given it a try?

www.cloudera.com/products/data-science-and-engineering/data-science-workbench.html

 

Cheers!

Manuel

 

Announcements