Created 12-28-2016 03:19 PM
Sandbox 2.5 on Virtualbox 5.1.12 on a Windows 10 machine.
I am trying to load a text file using Spark in Scala and I am not sure where to place the files so they can be seen in Zeppelin. Is there a good tutorial to familiarize me with the access for Zeppelin? I have an SSH window open using the 127.0.0.1:4200 and can access the file system on the virtualbox but not sure where Zeppelin will be looking to read a file. I am not super saavy at Linux so working my way through.
The error I get is:
markFIle: org.apache.spark.rdd.RDD[string] = cdrs.txt MapPartitionsRDD[37] at textFile at <console.:31
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox.hortonworks.com: 8020/user/zeppelin/cdrs.txt
I have gone through some of the tutorials but have not seen anything related to how Zeppelin uses hdfs to read files verses me using the SSH to the VirtualBox at root to locate files.
Created 12-28-2016 03:27 PM
In Zepplin you can use:
%sh id pwd hdfs dfs -ls /user/zeppelin uid=503(zeppelin) gid=501(hadoop) groups=501(hadoop) /home/zeppelin
So this user you can use local or store it on hdfs at this users home dir: /user/zeppelin
Created 12-28-2016 03:34 PM
How do I get a file into that directory? Forgive my inexperience.
Created 12-28-2016 03:51 PM
I suggest following this tutorial, it show how to load data and copy files...
http://hortonworks.com/hadoop-tutorial/hands-on-tour-of-apache-spark-in-5-minutes/
Created 12-28-2016 11:52 PM
In general Zeppelin is running on the Zeppelin server machine in the cluster. So it cannot access local files from the users host machine.
The typical thing to do is to upload the file into HDFS and use the HDFS path in %spark notebook code to read the file using Spark.