Support Questions

Find answers, ask questions, and share your expertise

How do I get my full path to load my hdfs file

avatar
Explorer

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/demo/dataset.csv")

This is my code.

I am writing a scala program and I could not load my file. The demo is the directory that is inside hadoop. And datset.csv is the file that contains data.

I am very new to Hortonworks so please kindly give a detailed answer for this.

19 REPLIES 19

avatar
Super Guru

As you can see the output of ls /demo , there is no dataset.csv in /demo folder. May be it is in /demo/data. Please check the correct path for the input csv file.

Can you please try running this command

su hdfs
hdfs dfs -ls -R /demo

avatar
Explorer

avatar
Master Mentor

@Aishwarya Sudhakar
Your demo directory in hdfs is empty, You will need to copy the
dataset.csv to HDFS in /demo

These are the steps to do :

Locate the dataset.csv in this example its in the /tmp onthe local node

As user hdfs

$ hdfs dfs -mkdir /demo

Copy the dataset.csv to hdfs

$ hdfs dfs  -put  /tmp/dataset.csv /demo

Make sure the user running the spark has the correct permissions else

Change the owner where xxx is the user running spark

$ hdfs dfs -chown   xxx:hdfs  /demo

Now run your spark

Hope that helps

avatar
Explorer

@Geoffrey Shelton Okot

may be but when i type this command i can see my data that is there in the file

 hadoop fs -cat demo/dataset.csv


avatar
Master Mentor

@Aishwarya Sudhakar

Yes to validate that the file you copied has the desired data. You forgot the / before demo

$ hdfs dfs -cat /demo/dataset.csv

Hope that helps

avatar
Explorer

@Geoffrey Shelton Okot

yes u are write. if i type the command u sent i could not able to view the file and it is coming like file does not exist.

i am basic to spark now can you tell me what should i do now in detail please.

it will be very useful for my project.

avatar
Super Guru

@Aishwarya Sudhakar,

There is a difference between /demo/dataset.csv and demo/dataset.csv. The slash makes a difference. If -cat demo/dataset.csv gives you the file output, then you have to use the same path in spark scale code

Change your code like below

 val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("demo/dataset.csv")

.

Aditya

avatar
Explorer

@Aditya Sirna

@Geoffrey Shelton Okot

yes ok now can you tell me how i should change this command.

scala> val data = MLUtils.loadLibSVMFile(sc, "demo/dataset.csv")
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/spark/bin/demo/dataset.csv
  at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)
  at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:983)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
  at org.apache.spark.rdd.RDD.reduce(RDD.scala:965)
  at org.apache.spark.mllib.util.MLUtils$.computeNumFeatures(MLUtils.scala:92)
  at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:81)
  at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:138)
  at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:146)
  ... 48 elided


avatar
@Aishwarya Sudhakar

You need to understand that the directory structure you are mentioning and trying to access is different.

//This means there is a directory named demo on root and has a file named dataset.csv

/demo/dataset.csv
//This means there is a directory named demo in the user directory of user and has a file named dataset.csv

demo/dataset.csv

Now, you do try the following on your terminal to get the username.

whoami

Now use the output of this command to reach to your dataset.csv file. You will realize that

hadoop fs -cat demo/dataset.csv

is similar to

hadoop fs -cat /user/<your username>/demo/dataset.csv

You can evaluate that using the ls command on these directories.

hadoop fs -ls demohadoop fs -ls /user/ash/demo

Now to access these file(s), use the correct directory reference.

scala> val data = MLUtils.loadLibSVMFile(sc, "demo/dataset.csv")

Let know if that helps!

avatar
Master Mentor

@Aishwarya Sudhakar

Could you clarify which username under which you are running the spark under?

Because of its distributed aspect, you should copy the dataset.csv to HDFS users directory which is accessible to that user running the spark job.

According to your output above you file is HDFS directory /demo/demo/dataset.csv so your load should look like this

load "hdfs:////demo/demo/dataset.csv"

This is what you said. "The demo is the directory that is inside hadoop. And datset.csv is the file that contains data." Did you mean in HDFS?

Does the command print anything

$ hdfs dfs -cat  /demo/demo/dataset.csv

Please revert !