Member since
05-15-2015
27
Posts
0
Kudos Received
3
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4011 | 08-11-2015 09:38 AM | |
2500 | 07-24-2015 07:18 PM |
03-16-2017
12:27 PM
Hello: My apologoies, there is an error in the subject, it should have read CDH 5.7 not CDH 7.5. I'm trying to fix it. Also, if a newer version of CDH has support (or if it is on the road map) I'd be interested in hearing about it. With best regards: Bill M.
... View more
03-13-2017
03:32 PM
Hello:
We have some CDH 5.7 clusters, and we want to deploy servers in containers to the nodes (probably via yarn).
Informally, we have some workflows that contact remote servers to do a fairly complex computation, and are interested in using cluster nodes to host these servers (to promote more elastic use of our hardware). One approach we are entertaining is to attempt to package these servers in containers (docker is the current candidate), and we believe that some work has been done to launch docker containers via Yarn (e.g. https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/DockerContainerExecutor.html). My questions are:
Is it possible to deploy user configured containers (e.g. docker) to deploy servers via Yarn?
If so, would docker be a suitable container technology for this use case? If not docker, what would be better?
If we were to use docker containers for this use case, would all jobs in the system need to use docker containers? How does that affect the build process and (oozie based) deployment?
Thanks:
Bill
... View more
Labels:
- Labels:
-
Apache Oozie
-
Apache YARN
09-07-2016
05:02 PM
I have an existing job which has run successfully many times in the past but a recent launch using Yarn Cluster Mode with the same configuration as an earlier atttempt (which failed for other reasons) generated many instances of 6308860 [Stage 1:==============================================> (12508 + 712) / 13220]java.util.concurrent.RejectedExecutionException: Task scala.concurrent.impl.CallbackRunnable@6d3616e6 rejected from java.util.concurrent.ThreadPoolExecutor@4eef216[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 11 70] 6308861 at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047) 6308862 at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823) 6308863 at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369) 6308864 at scala.concurrent.impl.ExecutionContextImpl.execute(ExecutionContextImpl.scala:122) 6308865 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 6308866 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 6308867 at scala.concurrent.Promise$class.complete(Promise.scala:55) 6308868 at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) 6308869 at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) 6308870 at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) 6308871 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 6308872 at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) 6308873 at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) 6308874 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 6308875 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 6308876 at scala.concurrent.Promise$class.complete(Promise.scala:55) 6308877 at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) 6308878 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) 6308879 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) 6308880 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 6308881 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643) 6308882 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658) 6308883 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) 6308884 at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) 6308885 at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) 6308886 at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634) 6308887 at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) 6308888 at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685) 6308889 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 6308890 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 6308891 at scala.concurrent.Promise$class.trySuccess(Promise.scala:94) 6308892 at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:153) 6308893 at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onSuccess$1(NettyRpcEnv.scala:215) 6308894 at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$2.apply(NettyRpcEnv.scala:223) 6308895 at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$2.apply(NettyRpcEnv.scala:222) 6308896 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) 6308897 at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) 6308898 at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) 6308899 at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) 6308900 at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) 6308901 at scala.concurrent.Promise$class.complete(Promise.scala:55) 6308902 at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) 6308903 at scala.concurrent.Promise$class.success(Promise.scala:86) 6308904 at scala.concurrent.impl.Promise$DefaultPromise.success(Promise.scala:153) 6308905 at org.apache.spark.rpc.netty.LocalNettyRpcCallContext.send(NettyRpcCallContext.scala:50) 6308906 at org.apache.spark.rpc.netty.NettyRpcCallContext.reply(NettyRpcCallContext.scala:32) 6308907 at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$processCompletedContainers$1$$anonfun$apply$7$$anonfun$apply$8.apply(YarnAllocator.scala:517) 6308908 at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$processCompletedContainers$1$$anonfun$apply$7$$anonfun$apply$8.apply(YarnAllocator.scala:517) 6308909 at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 6308910 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 6308911 at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$processCompletedContainers$1$$anonfun$apply$7.apply(YarnAllocator.scala:517) 6308912 at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$processCompletedContainers$1$$anonfun$apply$7.apply(YarnAllocator.scala:512) 6308913 at scala.Option.foreach(Option.scala:236) 6308914 at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$processCompletedContainers$1.apply(YarnAllocator.scala:512) 6308915 at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$processCompletedContainers$1.apply(YarnAllocator.scala:442) [Deleted items from the stack, various places in spark and its libraries for brevity] 6308919 at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 6308920 at org.apache.spark.deploy.yarn.YarnAllocator.processCompletedContainers(YarnAllocator.scala:442) 6308921 at org.apache.spark.deploy.yarn.YarnAllocator.allocateResources(YarnAllocator.scala:242) 6308922 at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:368) This is a new symptom to me, any idea what causes it and whether this is responsible for this job's failure? Is there some reason why these messages are not logged with timestamps?
... View more
Labels:
- Labels:
-
Apache Spark
07-06-2016
12:08 PM
We have spark jobs that we need inn production that fail after running for a sufficiently long time. The probability seems to increase with the stages, even when we greatly reduce the scale (say 100 fold). In this particular case it the spark job failed to save the data after a long computation. The same job with the same data on the same machine succeeds when retried. Thus, it is likely something other than simply program + data. We get the following stacktrace. org.apache.spark.SparkException: Job aborted due to stage failure: Task 230 in stage 3.0 failed 4 times, most recent failure: Lost task 230.3 in stage 3.0 (TID 8681, anode.myCluster.myCompany.com): java.io.IOException: Failed to connect to anode.myCluster.myCompany.com/aaa.bbb.ccc.ddd:eeeee
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: anode.myCluster.myCompany.com/aaa.bbb.ccc.ddd:fffff
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294) [Stuff deleted] at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1432) at com.myCompany.myClass(MyCode.scala:120) Consider what is happening at MyCode.scala:120, the main workflow for the spark job, it is the body of the MyCode class' myMethod, and uses the MyCode companion object: 100 def myMethod(sc : SparkContext, curringParameters : CurryingParameters, filterParameter : Int, numberOfTasksToUse : Int, desiredNumberOfPartitions : Int) : Unit = {
101 val bcCurryingParameters = sc.broadcast(curryingParameters);
102 val lines = sc.textFile(inputPath).repartition(numberOfTasksToUse);
103 val resultsAndSomeMetrics = lines.mapPartitions(MyCode.processLines(bcParameters.value)).
104 filter(aPair => apair._2 != MyCode.filteredValue).
105 mapPartitions( MyCode.automatedTypeDependentClosureConstruction(bcCurryingParameters.value)). // this is an extremely expensive mapping only function, we don't want to reevaluate it.
106 persist(StorageLevel.DISK_ONLY); // so we persist the results
107
108 // TODO: This statement was introduced to force materialization of resultsAndMetrics
109 val numResultsAndmetrics = resultAndMetrics.count;
110
111 logger.error(s"""There were ${numResultsAndmetrics} results and metrics.""");
112
113 val results = resultsAndMetrics.
114 map(resultAndMetric => resultAndMetric._1).
115 map(result => Json.toJson(result)). // Json serialize via play framework
116 map(jsValue => Json.stringify(jsValue));
117
118 results.
119 coalesce(desiredNumberOfPartitions, true). // turned on shuffling due to anecdotal recommendations within my team
120 saveAsTextFile(routedTripsPath, classOf[GzipCodec]);
121 Please note that the output of Line 111 was found in the logs, so the count was run. 2016:06:29:12:35:26.424 [Driver] ERROR com.mycompany.mypackage.MyCode.myMethod:111 - There were xxxxx results and metrics. From the stack trace and the barrier nature of the count, I strongly suspect that the death is occuring in evaluation of the lines 118-120, when the results of this 1:45 minutes of computattion are saved. However, it failed, but why? Interestingly at 12:12:34 there is a failure, but the code has no explicit destroys due to previous issues we experienced, and small (assuredly less than 10 KB at most, more likely less than 1 KB) broadcast data. 2016:06:29:12:12:34.021 [Spark Context Cleaner] ERROR org.apache.spark.ContextCleaner.logError:96 - Error cleaning broadcast 7
org.apache.spark.rpc.RpcTimeoutException: Timed out. This timeout is controlled by spark.rpc.askTimeout
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214) ~[main.jar:na]
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229) ~[main.jar:na]
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225) ~[main.jar:na]
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) ~[main.jar:na]
at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185) ~[main.jar:na]
at scala.util.Try$.apply(Try.scala:161) ~[main.jar:na]
at scala.util.Failure.recover(Try.scala:185) ~[main.jar:na]
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) ~[main.jar:na]
at scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:324) ~[main.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) ~[main.jar:na]
at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) ~[main.jar:na]
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) ~[main.jar:na]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) ~[main.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) ~[main.jar:na]
at scala.concurrent.Promise$class.complete(Promise.scala:55) ~[main.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) ~[main.jar:na]
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) ~[main.jar:na]
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) ~[main.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) ~[main.jar:na]
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.processBatch$1(Future.scala:643) ~[main.jar:na]
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply$mcV$sp(Future.scala:658) ~[main.jar:na]
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) ~[main.jar:na]
at scala.concurrent.Future$InternalCallbackExecutor$Batch$$anonfun$run$1.apply(Future.scala:635) ~[main.jar:na]
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) ~[main.jar:na]
at scala.concurrent.Future$InternalCallbackExecutor$Batch.run(Future.scala:634) ~[main.jar:na]
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) ~[main.jar:na]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:685) ~[main.jar:na]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) ~[main.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) ~[main.jar:na]
at scala.concurrent.Promise$class.complete(Promise.scala:55) ~[main.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:153) ~[main.jar:na]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249) ~[main.jar:na]
at scala.concurrent.Future$$anonfun$flatMap$1.apply(Future.scala:249) ~[main.jar:na]
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) ~[main.jar:na]
at org.spark-project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) ~[main.jar:na]
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:133) ~[main.jar:na]
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40) ~[main.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248) ~[main.jar:na]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334) ~[main.jar:na]
at akka.actor.Scheduler$$anon$11.run(Scheduler.scala:118) ~[main.jar:na]
at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) ~[main.jar:na]
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) ~[main.jar:na]
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:455) ~[main.jar:na]
at akka.actor.LightArrayRevolverScheduler$$anon$12.executeBucket$1(Scheduler.scala:407) ~[main.jar:na]
at akka.actor.LightArrayRevolverScheduler$$anon$12.nextTick(Scheduler.scala:411) ~[main.jar:na]
at akka.actor.LightArrayRevolverScheduler$$anon$12.run(Scheduler.scala:363) ~[main.jar:na]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_45]
Caused by: akka.pattern.AskTimeoutException: Timed out
... 9 common frames omitted Closer to the time of death we see an executor on the same node as the driver is complaining about failing with a very long stacktrace. Container: container_e949_1462478996024_31856_01_000083 on anode.myCluseter.myCompany.com_gggg
====================================================================================================
LogType:stderr
Log Upload Time:Wed Jun 29 13:53:45 -0700 2016
LogLength:525
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/3/yarn/nm/usercache/myUserName/filecache/2032/main.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p1108.867/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
LogType:stdout
Log Upload Time:Wed Jun 29 13:53:45 -0700 2016
LogLength:57199
Log Contents:
2016:06:29:12:36:23.843 [Executor task launch worker-3] ERROR o.a.s.n.shuffle.RetryingBlockFetcher.fetchAllOutstanding:142 - Exception while beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to anode.myCluster.myCompany.com/aaa.bbb.ccc.fff:hhhhh
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) ~[main.jar:na]
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) ~[main.jar:na]
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) ~[main.jar:na]
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) [main.jar:na]
at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) [main.jar:na]
at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:97) [main.jar:na]
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:89) [main.jar:na]
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:605) [main.jar:na]
at org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:603) [main.jar:na]
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [main.jar:na]
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) [main.jar:na]
at org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:603) [main.jar:na]
[A bunch of lines from this pretty long stack trace deleted]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) ~[main.jar:na]
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ~[main.jar:na]
... 1 common frames omitted
2016:06:29:12:39:34.720 [SIGTERM handler] ERROR o.a.s.e.CoarseGrainedExecutorBackend.handle:57 - RECEIVED SIGNAL 15: SIGTERM Why do we have these errors and how do we fix this problem so that this job runs in a stable and predictable manner?
... View more
Labels:
05-05-2016
06:38 PM
We have a spark job reads a large number of gzipped text files in (40K) or so under production level loads and fails (but works fine for smaller inputs). We were advised to reduce our partition count, hence we implemented a persistent RDD that read in the inputs, coalesced them to an acceptable number of partitions (say around 4K) and then persisted them using StorageLevel.DISK_ONLY, the local file systems and memory seem very well provisioned for our workload yet this solution failed in spite of lengthy efforts, the job would consistently fail. In desperation we consume precious HDFS space (which has severe contention in our environment) to write back the coalesced partitions and load the partitions in a fresh RDD. We would prefer to persist the coalesced data a different way or have spark handle a larger number of partitions. Why can we save this data on HDFS but persist fails, even with DISK_ONLY? The (somewhat sanitized) stack trace of the failing coalesce and persist error follows, is it actually possible to use the local disk storage instead of hdfs for this. akka.event.Logging$Error$NoCause$: null 2016:05:05:15:32:55.829 [task-result-getter-1] ERROR o.a.spark.scheduler.TaskSetManager.logError:75 - Task 595 in stage 1.0 failed 4 times; aborting job 2016:05:05:15:32:55.854 [Driver] INFO c.a.g.myproject.common.MyProjectLogger.info:23 - Job completed 2016:05:05:15:32:55.856 [Driver] INFO c.a.g.myproject.common.MyProjectLogger.info:23 - postAction : begin 2016:05:05:15:32:57.070 [Driver] ERROR c.a.g.myproject.common.MyProjectLogger.error:31 - action [GalacticaPostProcessing] processing failed. stack trace : org.apache.spark.SparkException: Job aborted due to stage failure: Task 595 in stage 1.0 failed 4 times, most recent failure: Lost task 595.3 in stage 1.0 (TID 5009, a-cluster-node.mycompany.com): java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:522) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:312) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:58) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:115) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:162) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1124) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1065) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:989) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:965) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply$mcV$sp(PairRDDFunctions.scala:951) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply(PairRDDFunctions.scala:951) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply(PairRDDFunctions.scala:951) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:950) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:909) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply(PairRDDFunctions.scala:907) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply(PairRDDFunctions.scala:907) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:907) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply$mcV$sp(RDD.scala:1444) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply(RDD.scala:1432) at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply(RDD.scala:1432) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1432) at com.mycompany.myproject.mysubproject.postprocessing.MyClass.runAction(MyClass.scala:236) at com.mycompany.myproject.batch.action.BaseSparkAction.doRun(BaseSparkAction.scala:120) at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1$$anonfun$2.apply$mcZ$sp(BaseSparkAction.scala:153) at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1$$anonfun$2.apply(BaseSparkAction.scala:153) at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1$$anonfun$2.apply(BaseSparkAction.scala:153) at scala.util.Try$.apply(Try.scala:161) at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1.apply$mcV$sp(BaseSparkAction.scala:152) at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1.apply(BaseSparkAction.scala:147) at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1.apply(BaseSparkAction.scala:147) at scala.util.Try$.apply(Try.scala:161) at com.mycompany.myproject.batch.action.BaseSparkAction.run(BaseSparkAction.scala:147) at com.mycompany.myproject.batch.action.BaseSparkAction$.main(BaseSparkAction.scala:215) at com.mycompany.myproject.mysubproject.postprocessing.MyClass$.main(MyClass.scala:387) at com.mycompany.myproject.mysubproject.postprocessing.MyClass.main(MyClass.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525) Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:522) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:312) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:58) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:115) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:162) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) 2016:05:05:15:32:57.073 [Driver] ERROR o.a.s.deploy.yarn.ApplicationMaster.logError:96 - User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 595 in stage 1.0 failed 4 times, most recent failure: Lost task 595.3 in stage 1.0 (TID 5009, a-cluster-node.mycompany.com): java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:522) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:312) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:58) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:115) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:162) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 595 in stage 1.0 failed 4 times, most recent failure: Lost task 595.3 in stage 1.0 (TID 5009, a-cluster-node.mycompany.com): java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:522) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:312) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:58) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:115) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:162) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294) ~[main.jar:na] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282) ~[main.jar:na] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281) ~[main.jar:na] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) ~[main.jar:na] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) ~[main.jar:na] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281) ~[main.jar:na] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) ~[main.jar:na] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) ~[main.jar:na] at scala.Option.foreach(Option.scala:236) ~[main.jar:na] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) ~[main.jar:na] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507) ~[main.jar:na] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469) ~[main.jar:na] at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) ~[main.jar:na] at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) ~[main.jar:na] at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) ~[main.jar:na] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) ~[main.jar:na] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) ~[main.jar:na] at org.apache.spark.SparkContext.runJob(SparkContext.scala:1914) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1124) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1065) ~[main.jar:na] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) ~[main.jar:na] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) ~[main.jar:na] at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1065) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:989) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:965) ~[main.jar:na] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) ~[main.jar:na] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) ~[main.jar:na] at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:965) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply$mcV$sp(PairRDDFunctions.scala:951) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply(PairRDDFunctions.scala:951) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$3.apply(PairRDDFunctions.scala:951) ~[main.jar:na] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) ~[main.jar:na] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) ~[main.jar:na] at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:950) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:909) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply(PairRDDFunctions.scala:907) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$2.apply(PairRDDFunctions.scala:907) ~[main.jar:na] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) ~[main.jar:na] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) ~[main.jar:na] at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) ~[main.jar:na] at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:907) ~[main.jar:na] at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply$mcV$sp(RDD.scala:1444) ~[main.jar:na] at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply(RDD.scala:1432) ~[main.jar:na] at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$2.apply(RDD.scala:1432) ~[main.jar:na] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) ~[main.jar:na] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) ~[main.jar:na] at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) ~[main.jar:na] at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1432) ~[main.jar:na] at com.mycompany.myproject.mysubproject.postprocessing.MyClass.runAction(MyClass.scala:236) ~[main.jar:na] at com.mycompany.myproject.batch.action.BaseSparkAction.doRun(BaseSparkAction.scala:120) ~[main.jar:na] at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1$$anonfun$2.apply$mcZ$sp(BaseSparkAction.scala:153) ~[main.jar:na] at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1$$anonfun$2.apply(BaseSparkAction.scala:153) ~[main.jar:na] at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1$$anonfun$2.apply(BaseSparkAction.scala:153) ~[main.jar:na] at scala.util.Try$.apply(Try.scala:161) ~[main.jar:na] at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1.apply$mcV$sp(BaseSparkAction.scala:152) ~[main.jar:na] at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1.apply(BaseSparkAction.scala:147) ~[main.jar:na] at com.mycompany.myproject.batch.action.BaseSparkAction$$anonfun$1.apply(BaseSparkAction.scala:147) ~[main.jar:na] at scala.util.Try$.apply(Try.scala:161) ~[main.jar:na] at com.mycompany.myproject.batch.action.BaseSparkAction.run(BaseSparkAction.scala:147) ~[main.jar:na] at com.mycompany.myproject.batch.action.BaseSparkAction$.main(BaseSparkAction.scala:215) ~[main.jar:na] at com.mycompany.myproject.mysubproject.postprocessing.MyClass$.main(MyClass.scala:387) ~[main.jar:na] at com.mycompany.myproject.mysubproject.postprocessing.MyClass.main(MyClass.scala) ~[main.jar:na] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_45] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_45] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_45] at java.lang.reflect.Method.invoke(Method.java:497) ~[na:1.8.0_45] at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525) ~[main.jar:na] Caused by: java.lang.RuntimeException: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:836) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:125) at org.apache.spark.storage.DiskStore$$anonfun$getBytes$2.apply(DiskStore.scala:113) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1206) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:127) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:134) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:522) at org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:312) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58) at org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:58) at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:115) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:87) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:101) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:162) ~[main.jar:na] at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:103) ~[main.jar:na] at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) ~[main.jar:na] at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) ~[main.jar:na] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) ~[main.jar:na] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) ~[main.jar:na] at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) ~[main.jar:na] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) ~[main.jar:na] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) ~[main.jar:na] at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[main.jar:na] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) ~[main.jar:na] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) ~[main.jar:na] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) ~[main.jar:na] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) ~[main.jar:na] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) ~[main.jar:na] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) ~[main.jar:na] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) ~[main.jar:na] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) ~[main.jar:na] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) ~[main.jar:na] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) ~[main.jar:na] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) ~[main.jar:na] at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ~[main.jar:na] at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_45]
... View more
Labels:
03-09-2016
06:04 PM
I have a spark job that has a long running stage (24+ hours) running in batch mode using yarn-cluster mode. When a node hosting an executor fails during the job, the executor fails to restart and the stage is restarted causing the loss of valuable partial results and causing either the job to violate its deadlines and possibly causing the entire workflow to fail after very lengthy retries. The stage persists its output (tuples), the remaining stages run quite fast (first and second stages materializes via countbykey with very small amounts of data to shuffle (fewer than 100KB) and a stage that saves a projection of the tuple's data to HDFS (runs longer but has no shuffle). During the last few runs, the runs cannot complete due to various node level issues (e.g. in one case a disk failed on a node hosting an executor, in another, an executors nodemanager was restarted, etc.). The job runs reliably for shorter runs, the container's memory limit is generous relative to the memory actually used and so far it appears that (for longer runs that fail) all errors are precisely correlated with and triggerred by cluster node failures that cause one or more executors to fail. I thought spark was resilient and would recover the executor and retry just the failed tasks in the stage, but in practice, the executor fails to restart. I have the following questions: 1) Does spark migrate a failed executor to another container (on a different cluster node) before a stage fails? If not, is there something that can be done to encourage executor migration to a container on a different cluster node? 2) Under what conditions does spark trigger a stage failure (is it because the same executor dies 4 times?) 3) I'm not seeing adequate diagnostics in my yarn application logs to determine why the executor fails (and it is a very laborious process to look at all the logs on the cluster nodes to determine the operational failure). Is there a smart way to learn why an executor fails and to identify all failed executors from an already failed job? 4) Is there some setting I need to configure to get spark to be resilient against single node failures.
... View more
Labels:
10-28-2015
12:33 PM
I have a job where the executors crash at the end of ingestion and distnct of a modest sized data set, with what seems to be a generous resource allocation, due to a timeout on the mapTracker in spite of tuning efforts. What causes this and how can I fix it? I have a reasonable size data set (1.4 TB on a cluster with about 250 Nodes or so) which is loaded from HDFS, parsed to scala objects map to project a few fields/columns/data members, to some nested tuples field1 is a modest sized String (less than say 15 characters) the rest, field2, field3, and field4, are Int, distinct the columns (this is small, say 2 MB or so and is confirmed by the shuffle output memory in the spark UI. The code that fails looks something like: val desiredNumberOfPartitions=2000;
val cookedMyData = rawMyData.map(line => new MyData(line));
val uniqueGroups = cookedMyData.map(md => ((md.field1, md.field2), (md.field3, md.field4))).distinct().repartition(desiredNumberOfPartitions);
The Job runs (as seen in the spark ui) with a driver wotih 40 GB and processor cores and 1000 executors with 1 processor core and 4 GB each (i.e. pretty well resourced) until it hits the distinct and then it hangs on a straggler (the very last executor, pair) and then the job dies in a fiery cataclysm of doom (see the stack trace at the end of this post for details). I have tried some tuning and set the following sparkConf and akka parameters: // Here begins my custom settings
sparkConf.set("spark.akka.frameSize", "1000"); // Frame size in MB, workers should be able to send bigger messages, based on some recommended numbers
sparkConf.set("spark.akka.threads", "16"); // Try quadrupling the default number of threads for improved response time
sparkConf.set("spark.akka.askTimeout", "120"); // Increase from default 30 seconds in case it is too short
sparkConf.set("spark.kryoserializer.buffer.max.mb", "256"); // In case it helps and the logs for the executors look like: Container: container_e36_1445047866026_2759_01_000100 on someNode.myCluster.myCompany.com_8041
==================================================================================================
LogType:stderr
Log Upload Time:Tue Oct 27 17:42:12 -0700 2015
LogLength:519
Log Contents:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/10/yarn/nm/usercache/william_maniatty/filecache/83/main.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [ch.qos.logback.classic.util.ContextSelectorStaticBinder]
LogType:stdout
Log Upload Time:Tue Oct 27 17:42:12 -0700 2015
LogLength:8012
Log Contents:
17:16:35,434 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]
17:16:35,434 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]
17:16:35,435 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/data/10/yarn/nm/usercache/william_maniatty/filecache/83/main.jar!/logback.xml]
17:16:35,448 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@5606c0b - URL [jar:file:/data/10/yarn/nm/usercache/william_maniatty/filecache/83/main.jar!/logback.xml] is not of type file
17:16:35,602 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
17:16:35,605 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]
17:16:35,675 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [com.myCompany.myTeam] to DEBUG
17:16:35,675 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to ERROR
17:16:35,675 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]
17:16:35,676 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
17:16:35,678 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6646153 - Registering current configuration as safe fallback point
2015:10:27:17:41:12.763 [Executor task launch worker-1] ERROR o.a.spark.MapOutputTrackerWorker.logError:96 - Error communicating with MapOutputTracker
org.apache.spark.SparkException: Error sending message [message = GetMapOutputStatuses(3)]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209) ~[main.jar:na]
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:113) [main.jar:na]
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:164) [main.jar:na]
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) [main.jar:na]
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) [main.jar:na]
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) [main.jar:na]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) [main.jar:na]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) [main.jar:na]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [main.jar:na]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) [main.jar:na]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) [main.jar:na]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) [main.jar:na]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) [main.jar:na]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) [main.jar:na]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) [main.jar:na]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) [main.jar:na]
at org.apache.spark.scheduler.Task.run(Task.scala:64) [main.jar:na]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) [main.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) ~[main.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) ~[main.jar:na]
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) ~[main.jar:na]
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) ~[main.jar:na]
at scala.concurrent.Await$.result(package.scala:107) ~[main.jar:na]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195) ~[main.jar:na]
... 20 common frames omitted
2015:10:27:17:41:12.768 [Executor task launch worker-1] ERROR org.apache.spark.executor.Executor.logError:96 - Exception in task 162.0 in stage 4.0 (TID 29281)
org.apache.spark.SparkException: Error communicating with MapOutputTracker
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:117) ~[main.jar:na]
at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:164) ~[main.jar:na]
at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42) ~[main.jar:na]
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40) ~[main.jar:na]
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) ~[main.jar:na]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[main.jar:na]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[main.jar:na]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[main.jar:na]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[main.jar:na]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[main.jar:na]
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[main.jar:na]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) ~[main.jar:na]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) ~[main.jar:na]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) ~[main.jar:na]
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) ~[main.jar:na]
at org.apache.spark.scheduler.Task.run(Task.scala:64) ~[main.jar:na]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) ~[main.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_45]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_45]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_45]
Caused by: org.apache.spark.SparkException: Error sending message [message = GetMapOutputStatuses(3)]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:209) ~[main.jar:na]
at org.apache.spark.MapOutputTracker.askTracker(MapOutputTracker.scala:113) ~[main.jar:na]
... 19 common frames omitted
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) ~[main.jar:na]
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) ~[main.jar:na]
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) ~[main.jar:na]
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) ~[main.jar:na]
at scala.concurrent.Await$.result(package.scala:107) ~[main.jar:na]
at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:195) ~[main.jar:na]
... 20 common frames omitted
2015:10:27:17:42:05.646 [sparkExecutor-akka.actor.default-dispatcher-3] ERROR akka.remote.EndpointWriter.apply$mcV$sp:65 - AssociationError [akka.tcp://sparkExecutor@someNode.myCluster.myCompany.com:58865] -> [akka.tcp://sparkDriver@anotherNode.myCluster.myCompany.com:46814]: Error [Shut down address: akka.tcp://sparkDriver@anotherNode.myCluster.myCompany.com:46814] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://sparkDriver@anotherNode.myCluster.myCompany.com:46814
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
2015:10:27:17:42:05.693 [sparkExecutor-akka.actor.default-dispatcher-5] ERROR o.a.s.e.CoarseGrainedExecutorBackend.logError:75 - Driver Disassociated [akka.tcp://sparkExecutor@someNode.myCluster.myCompany.com:58865] -> [akka.tcp://sparkDriver@anotherNode.myCluster.myCompany.com:46814] disassociated! Shutting down.
... View more
Labels:
- Labels:
-
Apache Spark
08-11-2015
09:38 AM
The Cloudeara Manager Python Client API does not currently support Kerberos Authentication.
... View more
08-05-2015
12:25 PM
Our team has implemented an Oozie custom action and want to deploy it using our internal frameworks (rpm based).We have implemented deployment scripts using the Cloudera Manager Python Client API (see http://cloudera.github.io/cm_api/epydoc/5.4.0/index.html), which seems to require knowing the authentication credentials that is the plaintext user name and password of the cloudera manager server at install time to construct the ApiResource (which has only one constructor), in particular, the documentation reads as follows: __init__(self, server_host, server_port=None, username='admin', password='admin', use_tls=False, version=10) (Constructor) source code Creates a Resource object that provides API endpoints. Parameters: server_host - The hostname of the Cloudera Manager server. server_port - The port of the server. Defaults to 7180 (http) or 7183 (https). username - Login name. password - Login password. use_tls - Whether to use tls (https). version - API version. Returns: Resource object referring to the root. Overrides: object.__init__ Our local security team has suggested using Kerberos Keytabs for authentication (and I'm not expert at Kerberos yet). Can that approach work with the python client interface, if so how?
... View more
Labels:
- Labels:
-
Apache Oozie
-
Cloudera Manager
-
Kerberos
-
Security
07-24-2015
07:18 PM
I'm not sure when I need to access the configuration from a role (or a group) as opposed to using the service directly, but based on a separate communication I used the role. The snippet of code below is based on a working version (on a quickstart vm), and looks like: # peforms the post install oozie service management for the red orbit module
def main():
cmhost = 'quickstart.cloudera'
theUserName = "admin"
thePassword = "admin"
apiResource = ApiResource(cmhost, username=theUserName, password=thePassword)
# A kludgey way of specifying the cluster, but there is only one here
clusters = apiResource.get_all_clusters()
# TODO: This is a kludge to resolve the cluster
if (len(clusters) != 1):
print "There should one cluster, but there are " + repr(len(clusters)) + " clusters"
sys.exit(1)
cluster = clusters[0]
# TODO: These parameters are appropriate for the quick start vm, are they right for our deployment?
hueServiceName = "hue"
oozieServiceName = "oozie"
oozieServiceRoleName = 'oozie-OOZIE_SERVER'
oozieService = cluster.get_service(oozieServiceName)
hueService = cluster.get_service(hueServiceName)
oozieServiceRole = oozieService.get_role(oozieServiceRoleName)
# TODO: Is this always going tobe the oozie service role's name? We may need to use the Role Type to be safe
originalOozieConfig = oozieServiceRole.get_config(view='full')
oozieConfigUpdates = { "oozie_executor_extension_classes" : "org.apache.oozie.action.hadoop.MyCustomActionExecutor",
"oozie_workflow_extension_schemas" : "my-custom-action-0.1.xsd" }
for configUpdateKey in oozieConfigUpdates:
if (configUpdateKey in originalOozieConfig):
print repr(configUpdateKey) + " before update has oozie configured value = " + str(originalOozieConfig[configUpdateKey])
else:
print repr(configUpdateKey) + " not previously configured in oozie"
# stop the oozie service after stoping any services depending on oozie (i.e. hue)
for service in [hueService, oozieService]:
print "stopping service " + repr(service.name)
service.stop().wait() # synchronous stop
print "service " + repr(service.name) + " stopped"
# update the configuration while the servers are quiescent
updatedOozieConfig = oozieServiceRole.update_config(oozieConfigUpdates)
print "updatedOozieConfig = " + repr(updatedOozieConfig)
for configUpdateKey in oozieConfigUpdates:
print 'Config after update for key = ' + repr(configUpdateKey) + " has value = " + repr(updatedOozieConfig[configUpdateKey])
# restart the oozie service before restarting any services depending on oozie (i.e. hue)
for service in [oozieService, hueService]:
print "retarting service " + repr(service.name)
service.restart().wait() # synchronous restart
print "service " + repr(service.name) + " restarted"
# Done!
return
... View more