Support Questions

Aish · ‎04-01-2018

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/demo/dataset.csv")

This is my code.

I am writing a scala program and I could not load my file. The demo is the directory that is inside hadoop. And datset.csv is the file that contains data.

I am very new to Hortonworks so please kindly give a detailed answer for this.

asirna · ‎04-01-2018

As you can see the output of ls /demo , there is no dataset.csv in /demo folder. May be it is in /demo/data. Please check the correct path for the input csv file.

Can you please try running this command

su hdfs
hdfs dfs -ls -R /demo

Aish · ‎04-01-2018

@PJ

@Sandeep Kumar

Shelton · ‎04-01-2018

@Aishwarya Sudhakar
Your demo directory in hdfs is empty, You will need to copy the dataset.csv to HDFS in /demo

These are the steps to do :

Locate the dataset.csv in this example its in the /tmp onthe local node

As user hdfs

$ hdfs dfs -mkdir /demo

Copy the dataset.csv to hdfs

$ hdfs dfs  -put  /tmp/dataset.csv /demo

Make sure the user running the spark has the correct permissions else

Change the owner where xxx is the user running spark

$ hdfs dfs -chown   xxx:hdfs  /demo

Now run your spark

Hope that helps

Aish · ‎04-01-2018

@Geoffrey Shelton Okot

may be but when i type this command i can see my data that is there in the file

 hadoop fs -cat demo/dataset.csv

Shelton · ‎04-01-2018

@Aishwarya Sudhakar

Yes to validate that the file you copied has the desired data. You forgot the / before demo

$ hdfs dfs -cat /demo/dataset.csv

Hope that helps

Aish · ‎04-01-2018

@Geoffrey Shelton Okot

yes u are write. if i type the command u sent i could not able to view the file and it is coming like file does not exist.

i am basic to spark now can you tell me what should i do now in detail please.

it will be very useful for my project.

asirna · ‎04-01-2018

@Aishwarya Sudhakar,

There is a difference between /demo/dataset.csv and demo/dataset.csv. The slash makes a difference. If -cat demo/dataset.csv gives you the file output, then you have to use the same path in spark scale code

Change your code like below

 val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("demo/dataset.csv")

.

Aditya

Aish · ‎04-01-2018

@Aditya Sirna

@Geoffrey Shelton Okot

yes ok now can you tell me how i should change this command.

scala> val data = MLUtils.loadLibSVMFile(sc, "demo/dataset.csv")
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/spark/bin/demo/dataset.csv
  at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)
  at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:983)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
  at org.apache.spark.rdd.RDD.reduce(RDD.scala:965)
  at org.apache.spark.mllib.util.MLUtils$.computeNumFeatures(MLUtils.scala:92)
  at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:81)
  at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:138)
  at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:146)
  ... 48 elided

RahulSoni · ‎04-01-2018

@Aishwarya Sudhakar

You need to understand that the directory structure you are mentioning and trying to access is different.

//This means there is a directory named demo on root and has a file named dataset.csv

/demo/dataset.csv

//This means there is a directory named demo in the user directory of user and has a file named dataset.csv

demo/dataset.csv

Now, you do try the following on your terminal to get the username.

whoami

Now use the output of this command to reach to your dataset.csv file. You will realize that

hadoop fs -cat demo/dataset.csv

is similar to

hadoop fs -cat /user/<your username>/demo/dataset.csv

You can evaluate that using the ls command on these directories.

hadoop fs -ls demohadoop fs -ls /user/ash/demo

Now to access these file(s), use the correct directory reference.

scala> val data = MLUtils.loadLibSVMFile(sc, "demo/dataset.csv")

Let know if that helps!

Shelton · ‎04-01-2018

@Aishwarya Sudhakar

Could you clarify which username under which you are running the spark under?

Because of its distributed aspect, you should copy the dataset.csv to HDFS users directory which is accessible to that user running the spark job.

According to your output above you file is HDFS directory /demo/demo/dataset.csv so your load should look like this

load "hdfs:////demo/demo/dataset.csv"

This is what you said. "The demo is the directory that is inside hadoop. And datset.csv is the file that contains data." Did you mean in HDFS?

Does the command print anything

$ hdfs dfs -cat  /demo/demo/dataset.csv

Please revert !

Cloudera Community

Support Questions

How do I get my full path to load my hdfs file

Suggestions for Bulk Loading Large Files into HBas...

Hive file:/path/to/hiveddl.sql does not exist.

impala forces full table scan

Cache Aware Load Balancer in Apache HBase

how to build HDFS path to load data from ?!?!?!

Uploading Files for Cloudera Support - alternate m...

list HDFS files

Loading data into Hive Table from HDFS deletes the...

Impala UDF unable to load the .so file from HDFS

Nifi Moving files on HDFS path for the previous da...