Community Articles

nyadav · ‎05-25-2016

Hdfs can be accessed in R by specifying namenode host (hdfs://<hostname>:/user/test), but in case of namenode failover it won't work. So we should set hdfs config directory in R so it recognizes the hadoop namenode and failover configurations..

Set Hadoop and Spark home and config directories into R environment as below:

# set up the SPARK_HOME
Sys.setenv (SPARK_HOME="/usr/hdp/current/spark-client")
#set up the HADOOP config dir
Sys.setenv (YARN_CONF_DIR="/usr/hdp/current/hadoop-client/conf")
Sys.setenv (HADOOP_CONF_DIR="/usr/hdp/current/hadoop-client/conf")
# read the data file 
sc = sparkR.init(sparkPackages="com.databricks:spark-csv_2.11:1.0.3")
sqlContext = sparkRSQL.init(sc)
people = read.df(sqlContext, "hdfs://HDFS-HA/users/people.json", "json")
head(people)

Cloudera Community

Community Articles

How to access HDFS Files using Spark through HA configuration in R

Apache Hadoop

Apache Spark