Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

How do I get my full path to load my hdfs file

Explorer

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/demo/dataset.csv")

This is my code.

I am writing a scala program and I could not load my file. The demo is the directory that is inside hadoop. And datset.csv is the file that contains data.

I am very new to Hortonworks so please kindly give a detailed answer for this.

19 REPLIES 19

Mentor

@Aishwarya Sudhakar

The load should point to the hdfs location

load"hdfs:///demo/dataset.csv")

Hope that helps

Explorer

I tried but still the error is there

Explorer

@Geoffrey Shelton Okot

I tried that but still I get error

@Aishwarya Sudhakar,

Can you please paste the output of below commands

su hdfs
hdfs dfs -ls /demo

Also try giving load("hdfs://{namnodehost}:8020/demo/dataset.csv")

Explorer

@Aditya Sirna

[root@sandbox ~]# su hdfs
[hdfs@sandbox root]$ hdfs dfs -ls /demo
Found 1 items
drwx------ - hdfs hdfs 0 2014-12-16 19:48 /demo/data

Explorer

@Aditya Sirna

what is the namenodehost you mean here?

[root@sandbox bin]# ./spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
18/04/01 06:19:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/04/01 06:20:19 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
18/04/01 06:20:27 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect.
Spark context Web UI available at http://192.168.177.129:4041
Spark context available as 'sc' (master = local[*], app id = local-1522563621246).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Type :help for more information.

scala> 

this is how i have got into scala. is there any wrong with the scala installation? because if you see above there is a line saying unable to load native-hadoop library for your platform.......................

the error which i get is

scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("hdfs://demo/dataset.csv")
18/04/01 06:04:59 WARN DataSource: Error while looking for metadata directory.
java.lang.IllegalArgumentException: java.net.UnknownHostException: demo
  at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
  at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
  at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
  at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
  at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
  at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:352)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
  ... 48 elided
Caused by: java.net.UnknownHostException: demo
  ... 70 more

scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("hdfs:///demo/dataset.csv")
18/04/01 06:05:35 WARN DataSource: Error while looking for metadata directory.
java.io.IOException: Incomplete HDFS URI, no host: hdfs:///demo/dataset.csv
  at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:352)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
  ... 48 elided


Explorer

@Aditya Sirna

val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("hdfs:///demo/data.csv")
18/04/01 06:45:02 WARN DataSource: Error while looking for metadata directory.
java.io.IOException: Incomplete HDFS URI, no host: hdfs:///demo/data.csv
  at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:352)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
  ... 48 elided

scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("hdfs:///demo/data")
18/04/01 06:45:22 WARN DataSource: Error while looking for metadata directory.
java.io.IOException: Incomplete HDFS URI, no host: hdfs:///demo/data
  at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:352)
  at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
  at scala.collection.immutable.List.flatMap(List.scala:344)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
  ... 48 elided


i tried using this command but still i am getting errors

Explorer

@Aditya Sirna

which one should i use my actual data is in dataset.csv not the other

[root@sandbox bin]# hadoop fs -ls /demo/
Found 1 items
drwx------   - hdfs hdfs          0 2014-12-1...

Explorer
[root@sandbox ~]# su hdfs
[hdfs@sandbox root]$ hdfs dfs -ls /demo
Found 1 items
drwx------   - hdfs hdfs          0 2014-12-16 19:48 /demo/data

As you can see the output of ls /demo , there is no dataset.csv in /demo folder. May be it is in /demo/data. Please check the correct path for the input csv file.

Can you please try running this command

su hdfs
hdfs dfs -ls -R /demo

Explorer

Mentor

@Aishwarya Sudhakar
Your demo directory in hdfs is empty, You will need to copy the
dataset.csv to HDFS in /demo

These are the steps to do :

Locate the dataset.csv in this example its in the /tmp onthe local node

As user hdfs

$ hdfs dfs -mkdir /demo

Copy the dataset.csv to hdfs

$ hdfs dfs  -put  /tmp/dataset.csv /demo

Make sure the user running the spark has the correct permissions else

Change the owner where xxx is the user running spark

$ hdfs dfs -chown   xxx:hdfs  /demo

Now run your spark

Hope that helps

Explorer

@Geoffrey Shelton Okot

may be but when i type this command i can see my data that is there in the file

 hadoop fs -cat demo/dataset.csv


Mentor

@Aishwarya Sudhakar

Yes to validate that the file you copied has the desired data. You forgot the / before demo

$ hdfs dfs -cat /demo/dataset.csv

Hope that helps

Explorer

@Geoffrey Shelton Okot

yes u are write. if i type the command u sent i could not able to view the file and it is coming like file does not exist.

i am basic to spark now can you tell me what should i do now in detail please.

it will be very useful for my project.

@Aishwarya Sudhakar,

There is a difference between /demo/dataset.csv and demo/dataset.csv. The slash makes a difference. If -cat demo/dataset.csv gives you the file output, then you have to use the same path in spark scale code

Change your code like below

 val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("demo/dataset.csv")

.

Aditya

Explorer

@Aditya Sirna

@Geoffrey Shelton Okot

yes ok now can you tell me how i should change this command.

scala> val data = MLUtils.loadLibSVMFile(sc, "demo/dataset.csv")
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/usr/local/spark/bin/demo/dataset.csv
  at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287)
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315)
  at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.rdd.RDD.partitions(RDD.scala:246)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)
  at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:983)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
  at org.apache.spark.rdd.RDD.reduce(RDD.scala:965)
  at org.apache.spark.mllib.util.MLUtils$.computeNumFeatures(MLUtils.scala:92)
  at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:81)
  at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:138)
  at org.apache.spark.mllib.util.MLUtils$.loadLibSVMFile(MLUtils.scala:146)
  ... 48 elided


@Aishwarya Sudhakar

You need to understand that the directory structure you are mentioning and trying to access is different.

//This means there is a directory named demo on root and has a file named dataset.csv

/demo/dataset.csv
//This means there is a directory named demo in the user directory of user and has a file named dataset.csv

demo/dataset.csv

Now, you do try the following on your terminal to get the username.

whoami

Now use the output of this command to reach to your dataset.csv file. You will realize that

hadoop fs -cat demo/dataset.csv

is similar to

hadoop fs -cat /user/<your username>/demo/dataset.csv

You can evaluate that using the ls command on these directories.

hadoop fs -ls demohadoop fs -ls /user/ash/demo

Now to access these file(s), use the correct directory reference.

scala> val data = MLUtils.loadLibSVMFile(sc, "demo/dataset.csv")

Let know if that helps!

Mentor

@Aishwarya Sudhakar

Could you clarify which username under which you are running the spark under?

Because of its distributed aspect, you should copy the dataset.csv to HDFS users directory which is accessible to that user running the spark job.

According to your output above you file is HDFS directory /demo/demo/dataset.csv so your load should look like this

load "hdfs:////demo/demo/dataset.csv"

This is what you said. "The demo is the directory that is inside hadoop. And datset.csv is the file that contains data." Did you mean in HDFS?

Does the command print anything

$ hdfs dfs -cat  /demo/demo/dataset.csv

Please revert !