Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

spark-shell directories lookup failure

avatar
Explorer

Hi

 

I'm newbie in spark. Please help me. I try to run simple script in spark-shell:

 

 

import org.apache.spark.SparkFiles;
val inFile = sc.textFile(SparkFiles.get("test.data"));
inFile.first();

 

 

but on inFile.first() i got exception 

 

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://hdp-7:8020/tmp/spark-60b9bde7-d198-4a90-8f90-02e9cf77fa04/test.data

 

there is no such directory in HDFS, but i have directory on local fs : /tmp/spark-60b9bde7-d198-4a90-8f90-02e9cf77fa04 with 0 files inside.

I suppose the troubles in spark-shell startup - i got line in start log:

 

15/05/15 16:08:24 INFO HttpFileServer: HTTP File server directory is /tmp/spark-d0ea3c3a-db92-43de-bc3d-6e6a6fd415f2

 

It seems work directory created locally, but when i try get access to RDD it try get it from HDFS:

 

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://hdp-7:8020/tmp/spark-60b9bde7-d198-4a90-8f90-02e9cf77fa04/test.data

 

Cloudera Express 5.3.2, spark was installed as yarn application in Cloudera Manager console.

 

full log underline:

 

[root@hdp-16 ~]# spark-shell
15/05/15 16:08:19 INFO SecurityManager: Changing view acls to: root
15/05/15 16:08:19 INFO SecurityManager: Changing modify acls to: root
15/05/15 16:08:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/05/15 16:08:19 INFO HttpServer: Starting HTTP Server
15/05/15 16:08:19 INFO Utils: Successfully started service 'HTTP class server' on port 39187.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.2.0-SNAPSHOT
/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
15/05/15 16:08:24 INFO SecurityManager: Changing view acls to: root
15/05/15 16:08:24 INFO SecurityManager: Changing modify acls to: root
15/05/15 16:08:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/05/15 16:08:24 INFO Slf4jLogger: Slf4jLogger started
15/05/15 16:08:24 INFO Remoting: Starting remoting
15/05/15 16:08:24 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@hdp-16:51885]
15/05/15 16:08:24 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkDriver@hdp-16:51885]
15/05/15 16:08:24 INFO Utils: Successfully started service 'sparkDriver' on port 51885.
15/05/15 16:08:24 INFO SparkEnv: Registering MapOutputTracker
15/05/15 16:08:24 INFO SparkEnv: Registering BlockManagerMaster
15/05/15 16:08:24 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20150515160824-7963
15/05/15 16:08:24 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/05/15 16:08:24 INFO HttpFileServer: HTTP File server directory is /tmp/spark-d0ea3c3a-db92-43de-bc3d-6e6a6fd415f2
15/05/15 16:08:24 INFO HttpServer: Starting HTTP Server
15/05/15 16:08:24 INFO Utils: Successfully started service 'HTTP file server' on port 33870.
15/05/15 16:08:25 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/05/15 16:08:25 INFO SparkUI: Started SparkUI at http://hdp-16:4040
15/05/15 16:08:25 INFO Executor: Using REPL class URI: http://192.168.91.142:39187
15/05/15 16:08:25 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@hdp-16:51885/user/HeartbeatReceiver
15/05/15 16:08:25 INFO NettyBlockTransferService: Server created on 40784
15/05/15 16:08:25 INFO BlockManagerMaster: Trying to register BlockManager
15/05/15 16:08:25 INFO BlockManagerMasterActor: Registering block manager localhost:40784 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 40784)
15/05/15 16:08:25 INFO BlockManagerMaster: Registered BlockManager
15/05/15 16:08:26 INFO EventLoggingListener: Logging events to hdfs://hdp-7:8020/user/spark/applicationHistory/local-1431691705159
15/05/15 16:08:26 INFO SparkILoop: Created spark context..
Spark context available as sc.

scala> import org.apache.spark.SparkFiles;
import org.apache.spark.SparkFiles

scala> val inFile = sc.textFile(SparkFiles.get("test.data"));
15/05/15 16:08:33 INFO MemoryStore: ensureFreeSpace(258986) called with curMem=0, maxMem=278302556
15/05/15 16:08:33 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 252.9 KB, free 265.2 MB)
15/05/15 16:08:33 INFO MemoryStore: ensureFreeSpace(21113) called with curMem=258986, maxMem=278302556
15/05/15 16:08:33 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.6 KB, free 265.1 MB)
15/05/15 16:08:33 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:40784 (size: 20.6 KB, free: 265.4 MB)
15/05/15 16:08:33 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/05/15 16:08:33 INFO SparkContext: Created broadcast 0 from textFile at <console>:13
inFile: org.apache.spark.rdd.RDD[String] = /tmp/spark-60b9bde7-d198-4a90-8f90-02e9cf77fa04/test.data MappedRDD[1] at textFile at <console>:13

scala> inFile.first();
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://hdp-7:8020/tmp/spark-60b9bde7-d198-4a90-8f90-02e9cf77fa04/test.data
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.RDD.take(RDD.scala:1060)
at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:16)
at $iwC$$iwC$$iwC.<init>(<console>:21)
at $iwC$$iwC.<init>(<console>:23)
at $iwC.<init>(<console>:25)
at <init>(<console>:27)
at .<init>(<console>:31)
at .<clinit>(<console>)
at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:852)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1125)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:674)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:705)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:669)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:828)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:873)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:785)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:628)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:636)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:641)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:968)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:916)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:916)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1011)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

 

 

 

Do you have any ideas? 

 

 

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Why are you using SparkFiles? The path that you try to open is not defined because SparkFiles expects paths to files added through SparkContext.addFile(). Unless you have done that you should be using sc.textFile() and pass in the URI for the file (hdfs://... or something like it)

 

Wilfred

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

Why are you using SparkFiles? The path that you try to open is not defined because SparkFiles expects paths to files added through SparkContext.addFile(). Unless you have done that you should be using sc.textFile() and pass in the URI for the file (hdfs://... or something like it)

 

Wilfred

avatar
Explorer

Thank you for response! It solves my troubles.