Created on 04-01-2018 05:10 AM - edited 09-16-2022 06:03 AM
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("/demo/dataset.csv")
This is my code.
I am writing a scala program and I could not load my file. The demo is the directory that is inside hadoop. And datset.csv is the file that contains data.
I am very new to Hortonworks so please kindly give a detailed answer for this.
Created 04-01-2018 07:26 AM
The load should point to the hdfs location
load"hdfs:///demo/dataset.csv")
Hope that helps
Created 04-01-2018 07:29 AM
I tried but still the error is there
Created 04-01-2018 07:28 AM
@Geoffrey Shelton Okot
I tried that but still I get error
Created 04-01-2018 08:12 AM
Can you please paste the output of below commands
su hdfs hdfs dfs -ls /demo
Also try giving load("hdfs://{namnodehost}:8020/demo/dataset.csv")
Created 04-01-2018 08:28 AM
[root@sandbox ~]# su hdfs
[hdfs@sandbox root]$ hdfs dfs -ls /demo
Found 1 items
drwx------ - hdfs hdfs 0 2014-12-16 19:48 /demo/data
Created 04-01-2018 03:24 PM
what is the namenodehost you mean here?
[root@sandbox bin]# ./spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). 18/04/01 06:19:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/04/01 06:20:19 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 18/04/01 06:20:27 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect. Spark context Web UI available at http://192.168.177.129:4041 Spark context available as 'sc' (master = local[*], app id = local-1522563621246). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.0.0 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.7.0_71) Type in expressions to have them evaluated. Type :help for more information. scala>
this is how i have got into scala. is there any wrong with the scala installation? because if you see above there is a line saying unable to load native-hadoop library for your platform.......................
the error which i get is
scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("hdfs://demo/dataset.csv") 18/04/01 06:04:59 WARN DataSource: Error while looking for metadata directory. java.lang.IllegalArgumentException: java.net.UnknownHostException: demo at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:352) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132) ... 48 elided Caused by: java.net.UnknownHostException: demo ... 70 more scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("hdfs:///demo/dataset.csv") 18/04/01 06:05:35 WARN DataSource: Error while looking for metadata directory. java.io.IOException: Incomplete HDFS URI, no host: hdfs:///demo/dataset.csv at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:352) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132) ... 48 elided
Created 04-01-2018 03:24 PM
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("hdfs:///demo/data.csv") 18/04/01 06:45:02 WARN DataSource: Error while looking for metadata directory. java.io.IOException: Incomplete HDFS URI, no host: hdfs:///demo/data.csv at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:352) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132) ... 48 elided scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("hdfs:///demo/data") 18/04/01 06:45:22 WARN DataSource: Error while looking for metadata directory. java.io.IOException: Incomplete HDFS URI, no host: hdfs:///demo/data at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:352) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132) ... 48 elided
i tried using this command but still i am getting errors
Created 04-01-2018 03:24 PM
which one should i use my actual data is in dataset.csv not the other
[root@sandbox bin]# hadoop fs -ls /demo/ Found 1 items drwx------ - hdfs hdfs 0 2014-12-1...
Created 04-01-2018 08:26 AM
[root@sandbox ~]# su hdfs [hdfs@sandbox root]$ hdfs dfs -ls /demo Found 1 items drwx------ - hdfs hdfs 0 2014-12-16 19:48 /demo/data