Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

Explorer

I have a spark job that runs fine in spark 1.3.0 standalone locally.  It captures a stream and sends it to an external R script, using RDD.pipe.

 

But when I deploy it in Cloudera 5.4.4, it fails with

 

java.io.IOException: Cannot run program "/tmp/spark-7b544563-a696-4300-a4f0-866afb9a7a19/userFiles-a4730bc3-7817-4f94-af5d-6ba22dc49fa1/MyScript.R": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:186)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)

 

But if I connect to the node in question I DO see the sc.addFile() command did its job; the file is present at the executor node!  Furthermore it is globally readable and executable:

 

hdfs@ip-10-183-0-135:/tmp/spark-7b544563-a696-4300-a4f0-866afb9a7a19/userFiles-a4730bc3-7817-4f94-af5d-6ba22dc49fa1$ ls -l
-rwxrwxr-x 1 hdfs hdfs 3080 Jul 28 21:01 MyScript.R

Also, I am submitting the spark job as the user "hdfs".

 

What am I missing?

17 REPLIES 17

Re: RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

Master Collaborator

That does sound puzzling. I'd first triple-check that the file is being executed where you think it is -- on an executor, not a driver? Is that the whole command? I'm not sure if it would cause this error, but, I am also not sure you can run a script like this as if in the shell -- do you need to invoke it as an argument to R CMD ?

Re: RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

Explorer

Good call, I double-checked again today, and indeed the executor node is not getting the file.  Shouldn't sc.addFile() make it available to all executor nodes?

 

As for whether this should be possible to do... I'm basically doing what you see here.

 

The R script has a hashbang at the beginning (#!/usr/bin/env Rscript), which lets you invoke it like a shell command.  (e.g. see

here )

 

And it does work fine, executing the R script and returning the results, when submitting against --master local, just not --master yarn-client. When I use yarn-client, it tells me the file is not found, and if I go look in the /tmp/spark-* directory on the failing node, the file is not there.

Re: RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

Master Collaborator

Hm, but are you expecting the file to be put where the source file was? it isn't, it's put in the working directory.

Re: RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

Explorer

sowen, first let me say I really appreciate you helping out with this!

 

Not sure what you mean by "working directory"... I think I am misunderstanding something fundamental, so I think it's time I showed some code.  

 

I stripped out everything unrelated, to just get piping to R working:

 

 

object SimplePipeToR {
  def main(args: Array[String]) {
    val max = args(0).toInt
    val script = args(1)
    val scriptName = Paths.get(script).getFileName.toString

    val sc = new SparkContext(new SparkConf().setAppName("pipe-to-r"))
    try {
      sc.addFile(script)
      val inputs: RDD[Int] = sc.parallelize(1 to max)
      val sparkScript: String = SparkFiles.get(scriptName)
      println(s"sparkScript to execute: $sparkScript")
      val outputs: RDD[String] = inputs.pipe(sparkScript)
      outputs.foreach(println)
    } finally {
      // sleep for 5min so I have time to go look at the exector node...
      Thread.sleep(5 * 60 * 1000)
      sc.stop()
    }
  }

 

 

All this does is, send numbers from 1 to N to an R script, which spits back any that are divisible by 17.  So if you invoke it with 100 for N, it should spit back something like

51

68

17

34

85

 

If I ssh to 10.183.0.135, and send it to local[4], like this:

 

 

hdfs@ip-10-183-0-135:/home/ubuntu$ spark-submit --master local[4] --class my.package.SimplePipeToR /home/ubuntu/pipe-to-r/streaming-pipe-r-assembly-0.1-SNAPSHOT.jar 100 /home/ubuntu/pipe-to-r/detectAnomalies.R

 

 

It works:

 

 

sparkScript to execute: /tmp/spark-7d904c76-7527-47d9-a6ab-0920af20f3a0/userFiles-c441fb09-fd1d-4296-9cbc-1378cbe7c5f7/detectAnomalies.R

51
68
17
34
85

 

 

But if I try it against yarn-client, like this:

 

 

hdfs@ip-10-183-0-135:/home/ubuntu$ spark-submit --master yarn-client --class my.package.SimplePipeToR /home/ubuntu/pipe-to-r/streaming-pipe-r-assembly-0.1-SNAPSHOT.jar 100 /home/ubuntu/pipe-to-r/detectAnomalies.R

 

Then I see two different types of errors, depending on where it's trying to execute.  From the local node where I am launching  (10.183.0.135) I get "permission denied" errors:

 

 

sparkScript to execute: /tmp/spark-358cd249-511c-459c-a6d5-434c213bdc0e/userFiles-db6444b6-6ccc-4d2f-b2d5-992bd574989c/detectAnomalies.R

Lost task 1.0 in stage 0.0 (TID 1, ip-10-183-0-135): java.io.IOException: Cannot run program "/tmp/spark-358cd249-511c-459c-a6d5-434c213bdc0e/userFiles-db6444b6-6ccc-4d2f-b2d5-992bd574989c/detectAnomalies.R": error=13, Permission denied
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
	at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
	at org.apache.spark.scheduler.Task.run(Task.scala:64)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=13, Permission denied
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:186)
	at java.lang.ProcessImpl.start(ProcessImpl.java:130)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
	... 9 more

 

 

Then it retries it on a remote node (10.183.0.59) and I see another error, which is the "No such file or directory" I have been referring to:

 

Lost task 0.0 in stage 0.0 (TID 0, ip-10-183-0-59): java.io.IOException: Cannot run program "/tmp/spark-358cd249-511c-459c-a6d5-434c213bdc0e/userFiles-db6444b6-6ccc-4d2f-b2d5-992bd574989c/detectAnomalies.R": error=2, No such file or directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
	at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
	at org.apache.spark.scheduler.Task.run(Task.scala:64)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:186)
	at java.lang.ProcessImpl.start(ProcessImpl.java:130)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
	... 9 more

 

If I go look in the node from where I launched spark-submit (10.183.0.135, the one that gave me the "permission denied"), I do see the file in the place it is trying to look,and it seems to be globally readable and executable:

 

hdfs@ip-10-183-0-135:/home/ubuntu$ ls -l /tmp/spark-358cd249-511c-459c-a6d5-434c213bdc0e/userFiles-db6444b6-6ccc-4d2f-b2d5-992bd574989c/
total 4
-rwxrwxr-x 1 hdfs hdfs 620 Jul 30 17:33 detectAnomalies.R

 

So first question... why the "permission denied" on that node?

 

And if I go to the remote node (10.183.0.59, the one that gave me the "no such file or directory"), I don't even see a spark-* directory under /tmp :

 

hdfs@ip-10-183-0-59:/$ ls /tmp
cmflistener-stderr---agent-1145-1438264974-wMwnMA.log  hsperfdata_hdfs                              Jetty_ip.10.183.0.59_50075_datanode____1ri0ab
cmflistener-stdout---agent-1145-1438264974-YVzNb5.log  hsperfdata_oozie                             Jetty_ip.10.183.0.59_50090_secondary____ut9y6q
hbase-hbase                                            hsperfdata_root                              Jetty_ip.10.183.0.59_8042_node____9hdz1k
hsperfdata_flume                                       hsperfdata_yarn                              libleveldbjni-64-1-3169413485540758896.8
hsperfdata_hbase                                       Jetty_0_0_0_0_60030_regionserver____.h599vl

So second question... why is sc.addFile() not causing the file to get copied to the node at 10.183.0.59?

 

Thanks again so much for helping me get my head around this...

Re: RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

Master Collaborator

I mean that if you add a file /foo/bar/bing.sh it does not necessarily appear on the executor at /foo/bar/bing.sh.

However, I believe you are accessing it the right way, and letting SparkFiles figure out where it is.

 

I think "permission denied" is because the copy is not executable?

This is why I was thinking you should execute it as "/path/to/R ... script.R" instead. I'm not sure whether that's the underlying issue though.

 

The "not found" could be a read permission problem in an intermediate directory. What are those like? 

Are you working on a secure cluster?

Re: RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

Explorer
Not sure... I'm just a lowly developer new to the team and don't have much insight into how this was all set up. I think at this point I'm going to escalate this to Cloudera support. Will report back here with the findings. Thanks again for looking!
Highlighted

Re: RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

Master Collaborator

I think they'll probably ask for something similar, so I'd check if you can what ended up on the executor. Also, as a control, might try running a simple bash script. I doubt R is the factor here and would be good to rule it in/out.

Re: RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

Cloudera Employee

Hi guys,

 

I'm not sure how the Files thing works, but should the script be loaded into HDFS rather than a local file?


Scott

Re: RDD.pipe() is resulting in "No such file or directory" only when running in Cloudera

Master Collaborator

It can be a local file, HDFS file, or even an HTTP URI. It's useful therefore to always specify the scheme.