Created 06-09-2016 02:54 PM
Hello,
I know these questions seem very basic, but there seems to be a discrepancy between the HDFS structure in my sparkR and what I see in Ambari. In SparkR, the default working directory is "/usr/hdp/2.4.0.0-169/spark". But in Ambari, I don't see /usr, but /user, which does contain a /spark directory but this just contains a /.sparkStaging direcotry, which is empty.
I have tried to change the workign directory with setwd() but if I just pass directory path as string, e.g. "/user/" it throws error cannot change working directory. I can only seem to change to /tmp.
I could include more details, but I think I am missing something basic here, which will probably solve lots of other questions. Help please?
Thanks
Aidan
Created 06-13-2016 12:35 PM
Hello ,
Thanks everyone. As it turned out, some Ambari features were in maintenance mode, which meant there actually was a discrepancy between the discoverable folder structures. Turning off maintenance mode and rebooting did the trick!
Thanks
Aidan
Created 06-09-2016 03:04 PM
I believe the working directory will be local filesystem i.e under /usr. However the /user directory is hdfs location where it has each user home directory and spark will use it for staging area. You need to point the setwd() to some local path instead of hdfs path.
Created 06-11-2016 12:42 PM
Below is the example is used to access hdfs from SparkR.
-bash-4.1$ hadoop fs -ls /user/hdfs/passwd -rw-r--r-- 3 hdfs hdfs 2296 2016-06-09 16:29 /user/hdfs/passwd -bash-4.1$ -bash-4.1$ SparkR > sqlContext <- sparkRSQL.init(sc) > people <- read.df(sqlContext,"/user/hdfs/passwd", "text") > head(people)
If you created the hive table in non default location then kindly use below command to see the underlining hdfs location.
hive> desc extended "tablename";
To access hive through SparkR.
-bash-4.1$ SparkR
> hiveContext <- sparkRHive.init(sc) > sql(hiveContext, "CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") > sql(hiveContext, "LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src") # Queries can be expressed in HiveQL. > results <- sql(hiveContext, "FROM src SELECT key, value") # results is now a DataFrame >head(results)
Created 06-13-2016 11:57 AM
Hi @Aidan Condron Did you tried with above steps?
Created 06-09-2016 03:10 PM
You can change other elements of the default configuration by modifying spark-env.sh. You can change the following:
SPARK_MASTER_PORT / SPARK_MASTER_WEBUI_PORT, to use non-default ports SPARK_WORKER_CORES, to set the number of cores to use on this machine SPARK_WORKER_MEMORY, to set how much memory to use (for example 1000MB, 2GB) SPARK_WORKER_PORT / SPARK_WORKER_WEBUI_PORT SPARK_WORKER_INSTANCE, to set the number of worker processes per node SPARK_WORKER_DIR, to set the working directory of worker processes
Created 06-10-2016 11:24 AM
Thanks guys, but those answers aren't quite on point. I suppose the real question is how to access HDFS through SparkR. For example, I know hive tables are accessible, but if they are not in the default /apps/warehouse/ location, how do I find and read them? Thanks a million!
Created 06-13-2016 12:35 PM
Hello ,
Thanks everyone. As it turned out, some Ambari features were in maintenance mode, which meant there actually was a discrepancy between the discoverable folder structures. Turning off maintenance mode and rebooting did the trick!
Thanks
Aidan