Support Questions

Find answers, ask questions, and share your expertise

In Zeppelin loading a simple TextFile where do I put a file so Zeppelin will see it using a Spark TextFile read?

avatar
New Contributor

Sandbox 2.5 on Virtualbox 5.1.12 on a Windows 10 machine.

I am trying to load a text file using Spark in Scala and I am not sure where to place the files so they can be seen in Zeppelin. Is there a good tutorial to familiarize me with the access for Zeppelin? I have an SSH window open using the 127.0.0.1:4200 and can access the file system on the virtualbox but not sure where Zeppelin will be looking to read a file. I am not super saavy at Linux so working my way through.

The error I get is:

markFIle: org.apache.spark.rdd.RDD[string] = cdrs.txt MapPartitionsRDD[37] at textFile at <console.:31

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://sandbox.hortonworks.com: 8020/user/zeppelin/cdrs.txt

I have gone through some of the tutorials but have not seen anything related to how Zeppelin uses hdfs to read files verses me using the SSH to the VirtualBox at root to locate files.

4 REPLIES 4

avatar
Rising Star

In Zepplin you can use:

%sh 
id 
pwd
hdfs dfs -ls /user/zeppelin

uid=503(zeppelin) gid=501(hadoop) groups=501(hadoop)
/home/zeppelin

So this user you can use local or store it on hdfs at this users home dir: /user/zeppelin

avatar
New Contributor

How do I get a file into that directory? Forgive my inexperience.

avatar
Rising Star

I suggest following this tutorial, it show how to load data and copy files...

http://hortonworks.com/hadoop-tutorial/hands-on-tour-of-apache-spark-in-5-minutes/

avatar
Super Collaborator

In general Zeppelin is running on the Zeppelin server machine in the cluster. So it cannot access local files from the users host machine.

The typical thing to do is to upload the file into HDFS and use the HDFS path in %spark notebook code to read the file using Spark.