Created on 04-01-2018 05:10 AM - edited 09-16-2022 06:03 AM
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/demo/dataset.csv")
This is my code.
I am writing a scala program and I could not load my file. The demo is the directory that is inside hadoop. And datset.csv is the file that contains data.
I am very new to Hortonworks so please kindly give a detailed answer for this.
Created 04-01-2018 09:51 AM
As you can see the output of ls /demo , there is no dataset.csv in /demo folder. May be it is in /demo/data. Please check the correct path for the input csv file.
Can you please try running this command
su hdfs
hdfs dfs -ls -R /demo
Created 04-01-2018 09:27 AM
Created 04-01-2018 10:07 AM
@Aishwarya Sudhakar
Your demo directory in hdfs is empty, You will need to copy the dataset.csv to HDFS in /demo
These are the steps to do :
Locate the dataset.csv in this example its in the /tmp onthe local node
As user hdfs
$ hdfs dfs -mkdir /demo
Copy the dataset.csv to hdfs
$ hdfs dfs -put /tmp/dataset.csv /demo
Make sure the user running the spark has the correct permissions else
Change the owner where xxx is the user running spark
$ hdfs dfs -chown xxx:hdfs /demo
Now run your spark
Hope that helps
Created 04-01-2018 10:18 AM
Created 04-01-2018 10:32 AM
Yes to validate that the file you copied has the desired data. You forgot the / before demo
$ hdfs dfs -cat /demo/dataset.csv
Hope that helps
Created 04-01-2018 10:37 AM
yes u are write. if i type the command u sent i could not able to view the file and it is coming like file does not exist.
i am basic to spark now can you tell me what should i do now in detail please.
it will be very useful for my project.
Created 04-01-2018 10:45 AM
There is a difference between /demo/dataset.csv and demo/dataset.csv. The slash makes a difference. If -cat demo/dataset.csv gives you the file output, then you have to use the same path in spark scale code
Change your code like below
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("demo/dataset.csv")
.
Aditya
Created 04-01-2018 03:24 PM
yes ok now can you tell me how i should change this command.
scala> val data = MLUtils.loadLibSVMFile(sc, "demo/dataset.csv") org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/spark/bin/demo/dataset.csv at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1934) at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:983) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:358) at org.apache.spark.rdd.RDD.reduce(RDD.scala:965) at org.apache.spark.mllib.util.MLUtils$.computeNumFeatures(MLUtils.scala:92) at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:81) at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:138) at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:146) ... 48 elided
Created 04-01-2018 03:53 PM
You need to understand that the directory structure you are mentioning and trying to access is different.
//This means there is a directory named demo on root and has a file named dataset.csv /demo/dataset.csv
//This means there is a directory named demo in the user directory of user and has a file named dataset.csv demo/dataset.csv
Now, you do try the following on your terminal to get the username.
whoami
Now use the output of this command to reach to your dataset.csv file. You will realize that
hadoop fs -cat demo/dataset.csv
is similar to
hadoop fs -cat /user/<your username>/demo/dataset.csv
You can evaluate that using the ls command on these directories.
hadoop fs -ls demohadoop fs -ls /user/ash/demo
Now to access these file(s), use the correct directory reference.
scala> val data = MLUtils.loadLibSVMFile(sc, "demo/dataset.csv")
Let know if that helps!
Created 04-01-2018 05:15 PM
Could you clarify which username under which you are running the spark under?
Because of its distributed aspect, you should copy the dataset.csv to HDFS users directory which is accessible to that user running the spark job.
According to your output above you file is HDFS directory /demo/demo/dataset.csv so your load should look like this
load "hdfs:////demo/demo/dataset.csv"
This is what you said. "The demo is the directory that is inside hadoop. And datset.csv is the file that contains data." Did you mean in HDFS?
Does the command print anything
$ hdfs dfs -cat /demo/demo/dataset.csv
Please revert !