Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

About spark 1.5 release!

Re: About spark 1.5 release!

New Contributor

I figured it out, so now I have Spark 1.5.1 working with CDH 5.3.3

 

Due to this https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/DiGmGp93ZT4 I could not use a Spark version with Hadoop included, but had to download the one without Hadoop and then link CDH's Hadoop libraries to it, so in the end the call to pyspark looks like this:

 

SPARK_DIST_CLASSPATH=$(/usr/bin/hadoop classpath) SCALA_HOME=/usr/lib/scala HADOOP_CONF_DIR=/usr/lib/spark-1.5.1-bin-without-hadoop/hadoop-conf YARN_CONF_DIR=/usr/lib/spark-1.5.1-bin-without-hadoop/yarn-conf HADOOP_USER_NAME=myuser MASTER=yarn /usr/lib/spark-1.5.1-bin-without-hadoop/bin/pyspark --deploy-mode client

Re: About spark 1.5 release!

Expert Contributor

I'm attempting to build and deploy the Spark 1.5.1 on the existing CDH 5.4.2 cluster. Can you let me know once the tarball is built, do I need to deploy it to make it available across the CDH cluster? 

 

Thanks!

Highlighted

Re: About spark 1.5 release!

New Contributor

I am trying to run spark 1.5.1 with CDH 5.4. I am using spark 1.5.1 with hadoop 2.6 distribution. I have setup HADOOP_CONF_DIR to point to my hadoop conf folder.

 

  • When I start spark in local mode, I am able to read files from HDFS using sc.textfile and read from HIVE table using sqlContext.sql
  • When I start spark in yarn-client mode, I cannot read from HDFS. I am getting a No such file IOException. I can still read from HIVE table though.

What could be causing issue with reading files from HDFS in yarn-client mode? 

 

Thank you,

Sunil

Re: About spark 1.5 release!

Explorer

You shouldnt touch the existing Spark installation, just extract Spark 1.5 in a new location, use the configuration from your existing Spark installation. In yarn-client mode all jars are shipped across to the cluster by the application, so Spark 1.5 should run in parallel with the Spark that came with your CDH version.

Re: About spark 1.5 release!

New Contributor

@DeenarT, I have a separate install for 1.5.1

Re: About spark 1.5 release!

Explorer

have you copied spark-env.sh, spark-defaults.con and conf/yarn-conf/* from your existing CDH install to the new spark install?

Re: About spark 1.5 release!

New Contributor

@DeenarT,

 

Yes, I copied spark-env.sh, spark-defaults.conf and yarn-conf from the CDH install.

 

I had to remove local: prefix for the spark.yarn.jar property in spark-defaults.conf and pointed it to the assembly.jar in 1.5.1 install. as noted on this forum for pyspark to come up.

Re: About spark 1.5 release!

Explorer

what is the error message you are getting? Can you paste the stacktrace

Re: About spark 1.5 release!

New Contributor

I am running the following code inside pyspark,

d = sc.textfile('path_to_my_text_file_on_hdfs')
d.count()

With the default 1.3.0 spark that comes with CDH5.4, I get the count response back

With spark 1.5.1, running in local mode, I get the count response back

With spark 1.5.1, running in yarn-client mode, I get the following exception

 

15/10/29 12:22:06 INFO DAGScheduler: ResultStage 0 (count at <stdin>:1) failed in 1.347 s
15/10/29 12:22:06 INFO DAGScheduler: Job 0 failed: count at <stdin>:1, took 1.591735 s
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/me/apps/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 1006, in count
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/home/me/apps/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 997, in sum
    return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
  File "/home/me/apps/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 871, in fold
    vals = self.mapPartitions(func).collect()
  File "/home/me/apps/spark-1.5.1-bin-hadoop2.6/python/pyspark/rdd.py", line 773, in collect
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
  File "/home/me/apps/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
  File "/home/me/apps/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql/utils.py", line 36, in deco
    return f(*a, **kw)
  File "/home/me/apps/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, server): java.io.IOException: Cannot run program "python2.7": error=2, No such file or directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:88)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 14 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
	at scala.Option.foreach(Option.scala:236)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:207)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Cannot run program "python2.7": error=2, No such file or directory
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
	at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:88)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	... 1 more
Caused by: java.io.IOException: error=2, No such file or directory
	at java.lang.UNIXProcess.forkAndExec(Native Method)
	at java.lang.UNIXProcess.<init>(UNIXProcess.java:248)
	at java.lang.ProcessImpl.start(ProcessImpl.java:134)
	at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
	... 14 more

 

 

Re: About spark 1.5 release!

Explorer

The error says python2.7 couldn't be found. Why dont you try this on the scala shell (spark-shell) and see if it works?