Support Questions

Find answers, ask questions, and share your expertise

How to change SparkR working directory? How to find default SparkR working directory on HDFS?

avatar
Contributor

Hello,

I know these questions seem very basic, but there seems to be a discrepancy between the HDFS structure in my sparkR and what I see in Ambari. In SparkR, the default working directory is "/usr/hdp/2.4.0.0-169/spark". But in Ambari, I don't see /usr, but /user, which does contain a /spark directory but this just contains a /.sparkStaging direcotry, which is empty.

I have tried to change the workign directory with setwd() but if I just pass directory path as string, e.g. "/user/" it throws error cannot change working directory. I can only seem to change to /tmp.

I could include more details, but I think I am missing something basic here, which will probably solve lots of other questions. Help please?

Thanks

Aidan

1 ACCEPTED SOLUTION

avatar
Contributor

Hello ,

Thanks everyone. As it turned out, some Ambari features were in maintenance mode, which meant there actually was a discrepancy between the discoverable folder structures. Turning off maintenance mode and rebooting did the trick!

Thanks

Aidan

View solution in original post

6 REPLIES 6

avatar
Super Guru

@Aidan Condron

I believe the working directory will be local filesystem i.e under /usr. However the /user directory is hdfs location where it has each user home directory and spark will use it for staging area. You need to point the setwd() to some local path instead of hdfs path.

avatar
Super Guru

Below is the example is used to access hdfs from SparkR.

-bash-4.1$ hadoop fs -ls /user/hdfs/passwd
-rw-r--r--   3 hdfs hdfs       2296 2016-06-09 16:29 /user/hdfs/passwd
-bash-4.1$
-bash-4.1$ SparkR
> sqlContext <- sparkRSQL.init(sc)
> people <- read.df(sqlContext,"/user/hdfs/passwd", "text")
> head(people)

If you created the hive table in non default location then kindly use below command to see the underlining hdfs location.

hive> desc extended "tablename";

To access hive through SparkR.

-bash-4.1$ SparkR

> hiveContext <- sparkRHive.init(sc) 
> sql(hiveContext, "CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") 
> sql(hiveContext, "LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

# Queries can be expressed in HiveQL. 
> results <- sql(hiveContext, "FROM src SELECT key, value")
# results is now a DataFrame
>head(results)

avatar
Super Guru

Hi @Aidan Condron Did you tried with above steps?

avatar
Master Mentor

@Aidan Condron

You can change other elements of the default configuration by modifying spark-env.sh. You can change the following:

SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports
SPARK_WORKER_CORES, to set the number of cores to use on this machine
SPARK_WORKER_MEMORY, to set how much memory to use (for example 1000MB, 2GB)
SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT
SPARK_WORKER_INSTANCE, to set the number of worker processes per node
SPARK_WORKER_DIR, to set the working directory of worker processes

avatar
Contributor

Thanks guys, but those answers aren't quite on point. I suppose the real question is how to access HDFS through SparkR. For example, I know hive tables are accessible, but if they are not in the default /apps/warehouse/ location, how do I find and read them? Thanks a million!

avatar
Contributor

Hello ,

Thanks everyone. As it turned out, some Ambari features were in maintenance mode, which meant there actually was a discrepancy between the discoverable folder structures. Turning off maintenance mode and rebooting did the trick!

Thanks

Aidan