Support Questions

Bindal · ‎12-10-2019

I am running following command in Zeppelin.First created hive context with following code -

val hiveContext = new org.apache.spark.sql.SparkSession.Builder().getOrCreate()

then I tried to load a file from HDFS with following code -

val riskFactorDataFrame = spark.read.format("csv").option("header", "true").load("hdfs:///tmp/data/riskfactor1.csv")

but I am getting following error message "org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://sandbox-hdp.hortonworks.com:8020/tmp/data/riskfactor1.csv;"

I am quite new in Hadoop. Please help me figure out what wrong I am doing.

Shelton · ‎12-17-2019

@Bindal

Do the following steps

sandbox-hdp login: root
root@sandbox-hdp.hortonworks.com's password:
.....
[root@sandbox-hdp~] mkdir -p /tmp/data

[root@sandbox-hdp~]cd /tmp/data

Now here you should be in /tmp/data to validate that do

[root@sandbox-hdp~]pwd

copy your riskfactor1.csv to this directory using some tool win Winscp or Mobaxterm see my screenshot using winscp

My question is where is riskfactor1.csv file located? If that's not clear you can upload using the ambari view first navigate to /bindal/data and then select the upload please see attached screenshot to upload the file from your laptop.

After the successful upload, you can run your Zeppelin job and keep me posted

View solution in original post

Shelton · ‎12-11-2019

@Bindal

Spark expects the riskfactor1.csv file to be in hdfs path /tmp/data/ but to me it seems you have the file riskfactor1.csv on your local filesystem /tmp/data I have run the below from a sandbox

Please follow the below steps to resolve that "Path does not exist" error. Log on the CLI on your sandbox as user root then

Switch user to hdfs

[root@sandbox-hdp ~]# su - hdfs

Check the current hdfs directory

[hdfs@sandbox-hdp ~]$ hdfs dfs -ls /
Found 13 items
drwxrwxrwt+ - yarn hadoop 0 2019-10-01 18:34 /app-logs
drwxr-xr-x+ - hdfs hdfs 0 2018-11-29 19:01 /apps
drwxr-xr-x+ - yarn hadoop 0 2018-11-29 17:25 /ats
drwxr-xr-x+ - hdfs hdfs 0 2018-11-29 17:26 /atsv2
drwxr-xr-x+ - hdfs hdfs 0 2018-11-29 17:26 /hdp
drwx------+ - livy hdfs 0 2018-11-29 17:55 /livy2-recovery
drwxr-xr-x+ - mapred hdfs 0 2018-11-29 17:26 /mapred
drwxrwxrwx+ - mapred hadoop 0 2018-11-29 17:26 /mr-history
drwxr-xr-x+ - hdfs hdfs 0 2018-11-29 18:54 /ranger
drwxrwxrwx+ - spark hadoop 0 2019-11-24 22:41 /spark2-history
drwxrwxrwx+ - hdfs hdfs 0 2018-11-29 19:01 /tmp
drwxr-xr-x+ - hdfs hdfs 0 2019-09-21 13:32 /user

Create the directory in hdfs usually under /user/xxxx depending on the user but here we are creating a directory /tmp/data and giving an open permission 777 so any user can execute the spark

Create directory in hdfs

$ hdfs dfs -mkdir -p /tmp/data/

Change permissions

$ hdfs dfs -chmod 777 /tmp/data/

Now copy the riskfactor1.csv in the local filesystem to hdfs, here I am assuming the file is in /tmp

[hdfs@sandbox-hdp tmp]$ hdfs dfs -copyFromLocal /tmp/riskfactor1.csv /tmp/data

The above copies the riskfactor1.csv from local temp to hdfs location /tmp/data you can validate by running the below command

[hdfs@sandbox-hdp ]$ hdfs dfs -ls /tmp/data
Found 1 items
-rw-r--r-- 1 hdfs hdfs 0 2019-12-11 18:40 /tmp/data/riskfactor1.csv

Now you can run your spark in zeppelin it should succeed.

Please revert !

Bindal · ‎12-15-2019

@Shelton

Great thanks for your detailed response. I have added a print screen of Ambari 'File Views'. I want to know whether it is local file system or hdfs file system. However I run commands suggested by you. But when when I tried to run following command "hdfs dfs -copyFromLocal /tmp/data/riskfactor1.csv /tmp/data" . I got the message "copyFromLocal: `/tmp/data/riskfactor1.csv': No such file or directory". I am not sure where I am doing something wrong. Thanks again for your help. Eagerly waiting for your response.

Shelton · ‎12-16-2019

@Bindal

Thanks for sharing the screenshot. I can see from the screenshot your riskfactor and riskfactor1 are directories !! not files

Can you double click on either of them and see the contents.

I have mounted an old HDP 2.6.x for illustration whatever filesystem you see under Ambari view is in HDFS.

Here is the local filesystem

My Ambari view before the creation of the /Bindal/data the equivalent to /tmp/data

I created a directory in hdfs

Make the directory this is the local fine system

Copy the riskfactor1.csv from local file system /tmp/data

Check the copied file in hdfs

So a walk through from the Linux CLI as root user I created a directory in /tmp/data and placed the riskfactor1.csv in there then create a directory in HDFS /Bindal/data/.

I then copied the file from the local Linux boy to HDFS , I hope that explains the difference between local filesystem and hdfs.

Below is again a screenshot to show the difference

Once the file is in HDFS your zeppelin should run successfully, as reiterated in your screenshot you share you need to double click on riskfactor and riskfactor1 which are directory to see if the difference with my screenshots

HTH

Bindal · ‎12-16-2019

@Shelton

you are awesome. You are trying your best to help me. But I am just the starter, so missing the minute threads.

I followed the steps you have provided. But still the same issue. Pls see the screen shot. I want to gain a pinch of what you have mastered. Thanks.

Shelton · ‎12-17-2019

@Bindal

Do the following steps

sandbox-hdp login: root
root@sandbox-hdp.hortonworks.com's password:
.....
[root@sandbox-hdp~] mkdir -p /tmp/data

[root@sandbox-hdp~]cd /tmp/data

Now here you should be in /tmp/data to validate that do

[root@sandbox-hdp~]pwd

copy your riskfactor1.csv to this directory using some tool win Winscp or Mobaxterm see my screenshot using winscp

My question is where is riskfactor1.csv file located? If that's not clear you can upload using the ambari view first navigate to /bindal/data and then select the upload please see attached screenshot to upload the file from your laptop.

After the successful upload, you can run your Zeppelin job and keep me posted

HadoopHelp · ‎12-16-2019

Hi . Please try with below step :- df = spark.read.format("csv").option("header", "true").load("csvfile.csv") just remove "hdfs:///" from path and also try to create separate dir within user dir or other. then load that data and give that path with your code! Thanks HadoopHelp

Cloudera Community

Support Questions

"Path does not exist" error message received when trying to load data from HDFS