- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How do I get my full path to load my hdfs file
Created on ‎04-01-2018 05:10 AM - edited ‎09-16-2022 06:03 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/demo/dataset.csv")
This is my code.
I am writing a scala program and I could not load my file. The demo is the directory that is inside hadoop. And datset.csv is the file that contains data.
I am very new to Hortonworks so please kindly give a detailed answer for this.
Created ‎04-01-2018 09:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As you can see the output of ls /demo , there is no dataset.csv in /demo folder. May be it is in /demo/data. Please check the correct path for the input csv file.
Can you please try running this command
su hdfs
hdfs dfs -ls -R /demo
Created ‎04-01-2018 09:27 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎04-01-2018 10:07 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Aishwarya Sudhakar
Your demo directory in hdfs is empty, You will need to copy the dataset.csv to HDFS in /demo
These are the steps to do :
Locate the dataset.csv in this example its in the /tmp onthe local node
As user hdfs
$ hdfs dfs -mkdir /demo
Copy the dataset.csv to hdfs
$ hdfs dfs -put /tmp/dataset.csv /demo
Make sure the user running the spark has the correct permissions else
Change the owner where xxx is the user running spark
$ hdfs dfs -chown xxx:hdfs /demo
Now run your spark
Hope that helps
Created ‎04-01-2018 10:18 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Created ‎04-01-2018 10:32 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes to validate that the file you copied has the desired data. You forgot the / before demo
$ hdfs dfs -cat /demo/dataset.csv
Hope that helps
Created ‎04-01-2018 10:37 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yes u are write. if i type the command u sent i could not able to view the file and it is coming like file does not exist.
i am basic to spark now can you tell me what should i do now in detail please.
it will be very useful for my project.
Created ‎04-01-2018 10:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is a difference between /demo/dataset.csv and demo/dataset.csv. The slash makes a difference. If -cat demo/dataset.csv gives you the file output, then you have to use the same path in spark scale code
Change your code like below
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("demo/dataset.csv")
.
Aditya
Created ‎04-01-2018 03:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yes ok now can you tell me how i should change this command.
scala> val data = MLUtils.loadLibSVMFile(sc, "demo/dataset.csv") org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/spark/bin/demo/dataset.csv at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1934) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:983) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.reduce(RDD.scala:965) at org.apache.spark.mllib.util.MLUtils$.computeNumFeatures(MLUtils.scala:92) at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:81) at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:138) at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:146) ... 48 elided
Created ‎04-01-2018 03:53 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You need to understand that the directory structure you are mentioning and trying to access is different.
//This means there is a directory named demo on root and has a file named dataset.csv /demo/dataset.csv
//This means there is a directory named demo in the user directory of user and has a file named dataset.csv demo/dataset.csv
Now, you do try the following on your terminal to get the username.
whoami
Now use the output of this command to reach to your dataset.csv file. You will realize that
hadoop fs -cat demo/dataset.csv
is similar to
hadoop fs -cat /user/<your username>/demo/dataset.csv
You can evaluate that using the ls command on these directories.
hadoop fs -ls demohadoop fs -ls /user/ash/demo
Now to access these file(s), use the correct directory reference.
scala> val data = MLUtils.loadLibSVMFile(sc, "demo/dataset.csv")
Let know if that helps!
Created ‎04-01-2018 05:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you clarify which username under which you are running the spark under?
Because of its distributed aspect, you should copy the dataset.csv to HDFS users directory which is accessible to that user running the spark job.
According to your output above you file is HDFS directory /demo/demo/dataset.csv so your load should look like this
load "hdfs:////demo/demo/dataset.csv"
This is what you said. "The demo is the directory that is inside hadoop. And datset.csv is the file that contains data." Did you mean in HDFS?
Does the command print anything
$ hdfs dfs -cat /demo/demo/dataset.csv
Please revert !

- « Previous
-
- 1
- 2
- Next »