09-28-2015 04:40 PM
OS: CentOS 6.5
CDH-5.4.5:
hbase(1.0.0+cdh5.4.5+175), hdfs(2.6.0+cdh5.4.5+626),
kafka(0.8.2.0+kafka1.3.1+85), spark(1.3.0+cdh5.4.5+47) ,
yarn(2.6.0+cdh5.4.5+626), zookeeper(3.4.5+cdh5.4.5+92
Oryx-2: beta 1 and beta 2
Config File: als-test.conf
# Copyright (c) 2014, Cloudera, Inc. All Rights Reserved. # # Cloudera, Inc. licenses this file to you under the Apache License, # Version 2.0 (the "License"). You may not use this file except in # compliance with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # This software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR # CONDITIONS OF ANY KIND, either express or implied. See the License for # the specific language governing permissions and limitations under the # License. # A very basic example config file configuring only the essential elements to # run an ALS application. # It's possible to specify reusable values: kafka-brokers = "hadoop05:9092,hadoop06:9092,hadoop07:9092" zk-servers = "hadoop05:2181,hadoop06:2181,hadoop07:2181" hdfs-base = "hdfs://hadoop05:8020/user/bodao/Test" oryx { id = "ALS-Test" input-topic { broker = ${kafka-brokers} lock = { master = ${zk-servers} } message = { # Input topic topic = "Test-ALSIn" # Key/message classes that the framework receives, respectively key-class = "java.lang.String" message-class = "java.lang.String" # Decoder classes used to read/write key/message classes key-decoder-class = "kafka.serializer.StringDecoder" message-decoder-class = "kafka.serializer.StringDecoder" } } update-topic { broker = ${kafka-brokers} lock = { master = ${zk-servers} } message = { # Update topic topic = "Test-ALSUp" # Decoder/encoder classes used to read/write key/message classes decoder-class = "kafka.serializer.StringDecoder" encoder-class = "kafka.serializer.StringEncoder" # Max size in bytes of message to write to the update topic. Don't change this unless you # know what you're doing. PMML models larger than this will be passed a location on HDFS, # for example. This should match the max.message.bytes configured for the update topic. # This value can't be larger than about 64MB in any event. max-size = 16777216 } } batch { streaming { generation-interval-sec = 20 num-executors = 3 executor-cores = 4 executor-memory = "1g" #driver-memory = "384m" dynamic-allocation = true } update-class = "com.cloudera.oryx.app.batch.mllib.als.ALSUpdate" storage { data-dir = ${hdfs-base}"/data/" model-dir = ${hdfs-base}"/model/" } ui { port = 22220 } } speed { streaming { generation-interval-sec = 20 num-executors = 3 executor-cores = 2 executor-memory = "512m" #driver-memory = "384m" dynamic-allocation = true } model-manager-class = "com.cloudera.oryx.app.speed.als.ALSSpeedModelManager" ui { port = 22221 } } serving { memory = "384m" yarn = { instances = 1 core = "2" } model-manager-class = "com.cloudera.oryx.app.serving.als.model.ALSServingModelManager" application-resources = "com.cloudera.oryx.app.serving,com.cloudera.oryx.app.serving.als" api { port = 22222 } } }
Description: It has been an severe problem troubling me for a long time. Unfortunately I am still trapped in it because nobody seemed to encounter this nightmare before. Up to now, batch layer crashes frequently , speed sometimes exits after a long, and serving always works stably.
Following the guide on oryx.io, I succeeded to upload the movies-ration data to serving layer by wget, and batch and speed layers worked perfectly to return the computing result. So I ran three layer on backgroud, like
./oryx-run.sh batch --conf als-test.conf &
After serval hours, maybe 10 at first time, batch layer crashed with error (ERR-1):
Job aborted due to stage failure: Task 5 in stage 41329.0 failed 4 times, most recent failure: Lost task 5.3 in stage 41329.0 (TID 311412, hadoop05): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_9567_piece0 of broadcast_9567 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1155) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:107) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:242) at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Failed to get broadcast_9567_piece0 of broadcast_9567 at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1152) ... 38 more
I searched for useful materials, and I found out a surpring news -- the variable "spark.cleaner.ttl" have a strong correlation with ERR-1. The Spark version of CDH stays on 1.3.0, an unstable release, so it turned out to be an unsolvable headache.
I tried to restart the oryx-2 every 3 hours. But another error (ERR-2) came out and batch crashed:
org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer{file=/home/yarn/nm/usercache/bodao/appcache/application_1442445186451_0017/blockmgr-aeccd2a2-6225-4c02-9c33-bf4e531f056d/3d/shuffle_307_1022_0.data, offset=9403, length=4234} at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:61) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Error in opening FileSegmentManagedBuffer{file=/home/yarn/nm/usercache/bodao/appcache/application_1442445186451_0017/blockmgr-aeccd2a2-6225-4c02-9c33-bf4e531f056d/3d/shuffle_307_1022_0.data, offset=9403, length=4234} at org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:113) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$3.apply(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$3.apply(ShuffleBlockFetcherIterator.scala:300) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) ... 22 more Caused by: java.io.FileNotFoundException: /home/yarn/nm/usercache/bodao/appcache/application_1442445186451_0017/blockmgr-aeccd2a2-6225-4c02-9c33-bf4e531f056d/3d/shuffle_307_1022_0.data (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.<init>(FileInputStream.java:146) at org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:98) ... 27 more
and "spark.cleaner.ttl" seems to result in this. Worse Still, if I restart oryx-2 right after this two crash, batch layer will exit with similar error like ERR-1:
Job aborted due to stage failure: Task 6 in stage 497.0 failed 4 times, most recent failure: Lost task 6.3 in stage 497.0 (TID 25924, hadoop05): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_42_piece0 of broadcast_42 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1155) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164) ...
Afterwards, kafka caused some problems to me since my servers halted due to an accident poweroff cut(ERR-3):
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 5211.0 failed 4 times, most recent failure: Lost task 5.3 in stage 5211.0 (TID 204446, hadoop06): com.esotericsoftware.kryo.KryoException: com.ning.compress.lzf.LZFException: Corrupt input data, block did not start with 2 byte signature ('ZV') followed by type byte, 2-byte length) at com.esotericsoftware.kryo.io.Input.fill(Input.java:142) at com.esotericsoftware.kryo.io.Input.require(Input.java:155) at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337) at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60) at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: com.ning.compress.lzf.LZFException: Corrupt input data, block did not start with 2 byte signature ('ZV') followed by type byte, 2-byte length) at com.ning.compress.lzf.ChunkDecoder._reportCorruptHeader(ChunkDecoder.java:267) at com.ning.compress.lzf.impl.UnsafeChunkDecoder.decodeChunk(UnsafeChunkDecoder.java:55) at com.ning.compress.lzf.LZFInputStream.readyBuffer(LZFInputStream.java:339) at com.ning.compress.lzf.LZFInputStream.read(LZFInputStream.java:169) at com.esotericsoftware.kryo.io.Input.fill(Input.java:140) ... 43 more
I couldn't the corrupted topics because kafka ran with different broker ids after servers rebooted and I had to create some new topics
At first, I wandered if I upload data to kafka directly on my program mattered. But batch layer still crashed with ERR-1 after some time (about 6 hours) when wget puts different movie-rating data evrey 5 min in crontab. OMG, it did drive me crazy, and I made a conclusion that "spark.cleaner.ttl" on verison 1.3.0 was the evil reason of my problems
And now, I am testing oryx-2 on CDH with spark-1.4.1 compiled. Since oryx-2 is not stable enough and offers few Java algorithms, I consider to conbime PreditionIO with it, which may be a better practice
09-28-2015 08:16 PM
10-07-2015 04:43 AM
All these errors snippets above are from temporary logs on spark application web pages.
ERR-2: What do you mean by saying "Are you running the shuffle service?"
As far as I know, shuffle is a common step on spark job processes, but I have no idea about this
error too. It s=may be associated with the ERR-1.
Actually, I heard of some changes of ttl on spark later version. I tested oryx serval times on my new cluster(CDh 5.4.5 and spark 1.4.1).
However, ERR-1 still showed up with other errors, and the snippets followed are from executor logs, which are realy large.
-rw-r--r-- 1 hadoop wheel 225M Oct 7 08:56 uhadoop-43nqrv-core1_57289 -rw-r--r-- 1 hadoop wheel 346M Oct 7 08:56 uhadoop-43nqrv-task1_56246 -rw-r--r-- 1 hadoop wheel 565M Oct 7 08:56 uhadoop-43nqrv-task2_55836 -rw-r--r-- 1 hadoop wheel 92M Oct 7 08:56 uhadoop-43nqrv-task3_52174
You see, there were four executors -- core1, task1, task2, task3.
task1:
15/10/07 02:17:12 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 772878 15/10/07 02:17:12 INFO executor.Executor: Running task 62.0 in stage 70948.0 (TID 772878) 15/10/07 02:17:12 INFO storage.BlockManager: Found block rdd_36849_62 locally 15/10/07 02:17:12 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 15/10/07 02:17:12 INFO compress.CodecPool: Got brand-new compressor [.gz] 15/10/07 02:17:12 INFO output.FileOutputCommitter: Saved output of task 'attempt_201510070217_70948_m_000062_772878' to hdfs://Ucluster/user/hadoop/item/model/.temporary/1444155243829/0/Y/_temporary/0/task_201510070217_70948_m_000062 15/10/07 02:17:12 INFO mapred.SparkHadoopMapRedUtil: attempt_201510070217_70948_m_000062_772878: Committed 15/10/07 02:17:12 INFO executor.Executor: Finished task 62.0 in stage 70948.0 (TID 772878). 2054 bytes result sent to driver 15/10/07 02:17:12 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 772885 15/10/07 02:17:12 INFO executor.Executor: Running task 68.0 in stage 70948.0 (TID 772885) 15/10/07 02:17:12 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM 15/10/07 02:17:12 INFO storage.DiskBlockManager: Shutdown hook called 15/10/07 02:17:12 INFO storage.BlockManager: Found block rdd_36849_68 locally 15/10/07 02:17:12 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 15/10/07 02:17:12 INFO compress.CodecPool: Got brand-new compressor [.gz] 15/10/07 02:17:12 INFO output.FileOutputCommitter: Saved output of task 'attempt_201510070217_70948_m_000068_772885' to hdfs://Ucluster/user/hadoop/item/model/.temporary/1444155243829/0/Y/_temporary/0/task_201510070217_70948_m_000068 15/10/07 02:17:12 INFO mapred.SparkHadoopMapRedUtil: attempt_201510070217_70948_m_000068_772885: Committed 15/10/07 02:17:12 INFO executor.Executor: Finished task 68.0 in stage 70948.0 (TID 772885). 2054 bytes result sent to driver
task2:
15/10/07 02:17:50 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1 15/10/07 02:17:50 INFO compress.CodecPool: Got brand-new compressor [.gz] 15/10/07 02:17:50 INFO output.FileOutputCommitter: Saved output of task 'attempt_201510070217_70948_m_000140_773827' to hdfs://Ucluster/user/hadoop/item/model/.temporary/1444155243829/0/Y/_temporary/0/task_201510070217_70948_m_000140 15/10/07 02:17:50 INFO mapred.SparkHadoopMapRedUtil: attempt_201510070217_70948_m_000140_773827: Committed 15/10/07 02:17:50 INFO executor.Executor: Finished task 10.0 in stage 70948.1 (TID 773827). 2905 bytes result sent to driver 15/10/07 02:17:51 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 773833 15/10/07 02:17:51 INFO executor.Executor: Running task 0.0 in stage 70978.0 (TID 773833) 15/10/07 02:17:51 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 12989 15/10/07 02:17:51 INFO storage.MemoryStore: ensureFreeSpace(19557) called with curMem=15783408, maxMem=556038881 15/10/07 02:17:51 INFO storage.MemoryStore: Block broadcast_12989_piece0 stored as bytes in memory (estimated size 19.1 KB, free 515.2 MB) 15/10/07 02:17:51 INFO broadcast.TorrentBroadcast: Reading broadcast variable 12989 took 6 ms 15/10/07 02:17:51 INFO storage.MemoryStore: ensureFreeSpace(43280) called with curMem=15802965, maxMem=556038881 15/10/07 02:17:51 INFO storage.MemoryStore: Block broadcast_12989 stored as values in memory (estimated size 42.3 KB, free 515.2 MB) 15/10/07 02:17:51 INFO spark.CacheManager: Partition rdd_36848_0 not found, computing it 15/10/07 02:17:51 INFO storage.BlockManager: Found block rdd_36596_0 locally 15/10/07 02:17:51 INFO storage.BlockManager: Found block rdd_36596_0 locally 15/10/07 02:17:51 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 4425, fetching them 15/10/07 02:17:51 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645]) 15/10/07 02:17:51 INFO spark.MapOutputTrackerWorker: Got the output locations 15/10/07 02:17:51 INFO storage.ShuffleBlockFetcherIterator: Getting 95 non-empty blocks out of 162 blocks 15/10/07 02:17:51 INFO client.TransportClientFactory: Found inactive connection to uhadoop-43nqrv-task1/10.10.185.149:47049, creating a new one. 15/10/07 02:17:51 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 18 outstanding blocks 15/10/07 02:17:51 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 18 outstanding blocks after 5000 ms 15/10/07 02:17:51 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 13 ms 15/10/07 02:17:56 INFO client.TransportClientFactory: Found inactive connection to uhadoop-43nqrv-task1/10.10.185.149:47049, creating a new one. 15/10/07 02:17:56 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 18 outstanding blocks (after 1 retries) java.io.IOException: Failed to connect to uhadoop-43nqrv-task1/10.10.185.149:47049 ...
task3:
5/10/07 02:36:32 INFO util.Utils: Fetching http://10.10.103.41:59384/jars/spark-examples-1.4.1-hadoop2.6.0-cdh5.4.4.jar to /data/yarn/local/usercache/hadoop/appcache/application_1444137930733_0009/fetchFileTemp2650740863846490596.tmp 15/10/07 02:36:32 INFO util.Utils: Copying /data/yarn/local/usercache/hadoop/appcache/application_1444137930733_0009/-1880841261444139405591_cache to /data/yarn/local/usercache/hadoop/appcache/application_1444137930733_0009/container_1444137930733_0009_01_000013/./spark-examples-1.4.1-hadoop2.6.0-cdh5.4.4.jar 15/10/07 02:36:32 INFO executor.Executor: Adding file:/data/yarn/local/usercache/hadoop/appcache/application_1444137930733_0009/container_1444137930733_0009_01_000013/./spark-examples-1.4.1-hadoop2.6.0-cdh5.4.4.jar to class loader 15/10/07 02:36:32 INFO spark.MapOutputTrackerWorker: Updating epoch to 4767 and clearing cache 15/10/07 02:36:32 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 13622 15/10/07 02:36:32 INFO storage.MemoryStore: ensureFreeSpace(5809) called with curMem=0, maxMem=556038881 15/10/07 02:36:32 INFO storage.MemoryStore: Block broadcast_13622_piece0 stored as bytes in memory (estimated size 5.7 KB, free 530.3 MB) 15/10/07 02:36:32 INFO broadcast.TorrentBroadcast: Reading broadcast variable 13622 took 116 ms 15/10/07 02:36:33 INFO storage.MemoryStore: ensureFreeSpace(11720) called with curMem=5809, maxMem=556038881 15/10/07 02:36:33 INFO storage.MemoryStore: Block broadcast_13622 stored as values in memory (estimated size 11.4 KB, free 530.3 MB) 15/10/07 02:36:33 INFO spark.CacheManager: Partition rdd_38540_33 not found, computing it 15/10/07 02:36:33 INFO spark.CacheManager: Partition rdd_38524_33 not found, computing it 15/10/07 02:36:33 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 4620, fetching them 15/10/07 02:36:33 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645]) 15/10/07 02:36:33 INFO spark.MapOutputTrackerWorker: Got the output locations 15/10/07 02:36:33 INFO storage.ShuffleBlockFetcherIterator: Getting 79 non-empty blocks out of 328 blocks 15/10/07 02:36:33 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 10 outstanding blocks java.io.IOException: Failed to connect to uhadoop-43nqrv-task2/10.10.158.33:58676 ... 15/10/07 02:36:33 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 10 outstanding blocks after 5000 ms 15/10/07 02:36:33 INFO storage.ShuffleBlockFetcherIterator: Started 6 remote fetches in 77 ms 15/10/07 02:36:38 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 10 outstanding blocks (after 1 retries) java.io.IOException: Failed to connect to uhadoop-43nqrv-task2/10.10.158.33:58676 ... ...
15/10/07 02:36:48 ERROR storage.ShuffleBlockFetcherIterator: Failed to get block(s) from uhadoop-43nqrv-task2:58676 java.io.IOException: Failed to connect to uhadoop-43nqrv-task2/10.10.158.33:58676 ... 15/10/07 02:36:51 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 840756 15/10/07 02:36:51 INFO executor.Executor: Running task 3.0 in stage 73994.1 (TID 840756) 15/10/07 02:36:51 INFO spark.MapOutputTrackerWorker: Updating epoch to 4778 and clearing cache 15/10/07 02:36:51 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 13624 15/10/07 02:36:51 INFO storage.MemoryStore: ensureFreeSpace(5809) called with curMem=17529, maxMem=556038881 15/10/07 02:36:51 INFO storage.MemoryStore: Block broadcast_13624_piece0 stored as bytes in memory (estimated size 5.7 KB, free 530.3 MB) 15/10/07 02:36:51 INFO broadcast.TorrentBroadcast: Reading broadcast variable 13624 took 10 ms 15/10/07 02:36:51 INFO storage.MemoryStore: ensureFreeSpace(11720) called with curMem=23338, maxMem=556038881 15/10/07 02:36:51 INFO storage.MemoryStore: Block broadcast_13624 stored as values in memory (estimated size 11.4 KB, free 530.2 MB) 15/10/07 02:36:51 INFO spark.CacheManager: Partition rdd_38540_23 not found, computing it 15/10/07 02:36:51 INFO spark.CacheManager: Partition rdd_38524_23 not found, computing it 15/10/07 02:36:51 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 4620, fetching them 15/10/07 02:36:51 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645]) 15/10/07 02:36:51 INFO spark.MapOutputTrackerWorker: Got the output locations 15/10/07 02:36:51 INFO storage.ShuffleBlockFetcherIterator: Getting 80 non-empty blocks out of 328 blocks 15/10/07 02:36:51 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 19 ms 15/10/07 02:36:51 INFO storage.MemoryStore: ensureFreeSpace(17024) called with curMem=35058, maxMem=556038881 15/10/07 02:36:51 INFO storage.MemoryStore: Block rdd_38524_23 stored as values in memory (estimated size 16.6 KB, free 530.2 MB) 15/10/07 02:36:51 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 15/10/07 02:36:51 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 15/10/07 02:36:51 INFO storage.MemoryStore: ensureFreeSpace(12984) called with curMem=52082, maxMem=556038881 15/10/07 02:36:51 INFO storage.MemoryStore: Block rdd_38540_23 stored as values in memory (estimated size 12.7 KB, free 530.2 MB) 15/10/07 02:36:51 INFO executor.Executor: Finished task 3.0 in stage 73994.1 (TID 840756). 2158 bytes result sent to driver 15/10/07 02:36:51 INFO storage.BlockManager: Removing RDD 38541 15/10/07 02:36:51 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 840783 ...
core1:
15/10/07 06:42:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1581651 15/10/07 06:42:49 INFO executor.Executor: Running task 7.0 in stage 106533.0 (TID 1581651) 15/10/07 06:42:49 INFO spark.CacheManager: Partition rdd_56722_7 not found, computing it 15/10/07 06:42:49 INFO storage.BlockManager: Found block rdd_56464_7 locally 15/10/07 06:42:49 INFO storage.BlockManager: Found block rdd_56464_7 locally 15/10/07 06:42:49 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 6625, fetching them 15/10/07 06:42:49 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645]) 15/10/07 06:42:49 INFO spark.MapOutputTrackerWorker: Got the output locations 15/10/07 06:42:49 INFO storage.ShuffleBlockFetcherIterator: Getting 121 non-empty blocks out of 183 blocks 15/10/07 06:42:49 INFO client.TransportClientFactory: Found inactive connection to uhadoop-43nqrv-task3/10.10.103.41:56230, creating a new one. 15/10/07 06:42:49 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 24 outstanding blocks java.io.IOException: Failed to connect to uhadoop-43nqrv-task3/10.10.103.41:56230 ... ... 15/10/07 06:43:04 ERROR storage.ShuffleBlockFetcherIterator: Failed to get block(s) from uhadoop-43nqrv-task3:56230 java.io.IOException: Failed to connect to uhadoop-43nqrv-task3/10.10.103.41:56230 ... ... 15/10/07 06:43:04 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1581660 15/10/07 06:43:04 INFO executor.Executor: Running task 1.0 in stage 106532.0 (TID 1581660) 15/10/07 06:43:04 INFO spark.MapOutputTrackerWorker: Updating epoch to 6927 and clearing cache 15/10/07 06:43:04 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19536 15/10/07 06:43:04 INFO storage.MemoryStore: ensureFreeSpace(18798) called with curMem=62163615, maxMem=556038881 15/10/07 06:43:04 INFO storage.MemoryStore: Block broadcast_19536_piece0 stored as bytes in memory (estimated size 18.4 KB, free 471.0 MB) 15/10/07 06:43:04 INFO broadcast.TorrentBroadcast: Reading broadcast variable 19536 took 5 ms 15/10/07 06:43:04 INFO storage.MemoryStore: ensureFreeSpace(41400) called with curMem=62182413, maxMem=556038881 15/10/07 06:43:04 INFO storage.MemoryStore: Block broadcast_19536 stored as values in memory (estimated size 40.4 KB, free 470.9 MB) 15/10/07 06:43:04 INFO storage.BlockManager: Found block rdd_56479_6 locally 15/10/07 06:43:04 INFO spark.CacheManager: Partition rdd_56698_6 not found, computing it 15/10/07 06:43:04 INFO storage.BlockManager: Found block rdd_56478_6 locally 15/10/07 06:43:04 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 6624, fetching them 15/10/07 06:43:04 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645]) 15/10/07 06:43:04 INFO spark.MapOutputTrackerWorker: Got the output locations 15/10/07 06:43:04 INFO storage.ShuffleBlockFetcherIterator: Getting 183 non-empty blocks out of 183 blocks 15/10/07 06:43:04 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 4 ms 15/10/07 06:43:04 INFO storage.MemoryStore: ensureFreeSpace(144) called with curMem=62223813, maxMem=556038881 15/10/07 06:43:04 INFO storage.MemoryStore: Block rdd_56698_6 stored as values in memory (estimated size 144.0 B, free 470.9 MB) 15/10/07 06:43:04 INFO executor.Executor: Finished task 1.0 in stage 106532.0 (TID 1581660). 2927 bytes result sent to driver 15/10/07 06:43:04 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1581667 15/10/07 06:43:04 INFO executor.Executor: Running task 11.0 in stage 106532.0 (TID 1581667) 15/10/07 06:43:04 INFO storage.BlockManager: Found block rdd_56479_51 locally 15/10/07 06:43:04 INFO spark.CacheManager: Partition rdd_56698_51 not found, computing it 15/10/07 06:43:04 INFO storage.BlockManager: Found block rdd_56478_51 locally 15/10/07 06:43:04 INFO storage.ShuffleBlockFetcherIterator: Getting 183 non-empty blocks out of 183 blocks 15/10/07 06:43:04 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 5 ms 15/10/07 06:43:04 INFO storage.MemoryStore: ensureFreeSpace(16) called with curMem=62223957, maxMem=556038881 15/10/07 06:43:04 INFO storage.MemoryStore: Block rdd_56698_51 stored as values in memory (estimated size 16.0 B, free 470.9 MB) 15/10/07 06:43:04 INFO executor.Executor: Finished task 11.0 in stage 106532.0 (TID 1581667). 2927 bytes result sent to driver 15/10/07 06:43:04 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1581668 ... ...
15/10/07 06:43:44 WARN common.AppInfo$: Can't read Kafka version from MANIFEST.MF. Possible cause: java.lang.NullPointerException 15/10/07 06:43:44 INFO producer.Producer: Shutting down producer 15/10/07 06:43:44 INFO async.ProducerSendThread: Begin shutting down ProducerSendThread 15/10/07 06:43:44 INFO client.ClientUtils$: Fetching metadata from broker id:0,host:uhadoop-43nqrv-master1,port:9092 with correlation id 0 for 1 topic(s) Set(item_up) 15/10/07 06:43:44 INFO producer.SyncProducer: Connected to uhadoop-43nqrv-master1:9092 for producing 15/10/07 06:43:44 INFO producer.SyncProducer: Disconnecting from uhadoop-43nqrv-master1:9092 15/10/07 06:43:44 INFO producer.SyncProducer: Connected to uhadoop-43nqrv-master2:9092 for producing 15/10/07 06:43:44 INFO async.ProducerSendThread: Shutdown ProducerSendThread complete 15/10/07 06:43:44 INFO producer.ProducerPool: Closing all sync producers 15/10/07 06:43:44 INFO producer.SyncProducer: Disconnecting from uhadoop-43nqrv-master2:9092 15/10/07 06:43:44 INFO producer.Producer: Producer shutdown completed in 4 ms 15/10/07 06:43:44 INFO executor.Executor: Finished task 180.0 in stage 106587.0 (TID 1583910). 1761 bytes result sent to driver 15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583912 15/10/07 06:43:44 INFO executor.Executor: Running task 182.0 in stage 106587.0 (TID 1583912) 15/10/07 06:43:44 INFO rdd.HadoopRDD: Input split: hdfs://Ucluster/user/hadoop/item/model/1444171422997/Y/part-00182.gz:0+20 15/10/07 06:43:44 INFO executor.Executor: Finished task 182.0 in stage 106587.0 (TID 1583912). 1761 bytes result sent to driver 15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583913 15/10/07 06:43:44 INFO executor.Executor: Running task 1.0 in stage 106588.0 (TID 1583913) 15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19567 15/10/07 06:43:44 INFO storage.MemoryStore: ensureFreeSpace(3744) called with curMem=12250491, maxMem=556038881 15/10/07 06:43:44 INFO storage.MemoryStore: Block broadcast_19567_piece0 stored as bytes in memory (estimated size 3.7 KB, free 518.6 MB) 15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Reading broadcast variable 19567 took 3 ms 15/10/07 06:43:44 INFO storage.MemoryStore: ensureFreeSpace(7072) called with curMem=12254235, maxMem=556038881 15/10/07 06:43:44 INFO storage.MemoryStore: Block broadcast_19567 stored as values in memory (estimated size 6.9 KB, free 518.6 MB) 15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Reading broadcast variable 19567 took 3 ms 15/10/07 06:43:44 INFO storage.MemoryStore: ensureFreeSpace(7072) called with curMem=12254235, maxMem=556038881 15/10/07 06:43:44 INFO storage.MemoryStore: Block broadcast_19567 stored as values in memory (estimated size 6.9 KB, free 518.6 MB) 15/10/07 06:43:44 INFO storage.BlockManager: Found block rdd_56435_1 locally 15/10/07 06:43:44 INFO executor.Executor: Finished task 1.0 in stage 106588.0 (TID 1583913). 2171 bytes result sent to driver 15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583916 15/10/07 06:43:44 INFO executor.Executor: Running task 5.0 in stage 106588.0 (TID 1583916) 15/10/07 06:43:44 INFO spark.CacheManager: Partition rdd_56436_1 not found, computing it 15/10/07 06:43:44 INFO rdd.NewHadoopRDD: Input split: hdfs://Ucluster/user/hadoop/item/data/oryx-1444139505000.data/part-r-00001:0+43446 15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19445 15/10/07 06:43:44 ERROR executor.Executor: Exception in task 5.0 in stage 106588.0 (TID 1583916) java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_19445_piece0 of broadcast_19445 ...
15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583917 15/10/07 06:43:44 INFO executor.Executor: Running task 5.1 in stage 106588.0 (TID 1583917) 15/10/07 06:43:44 INFO spark.CacheManager: Partition rdd_56436_1 not found, computing it 15/10/07 06:43:44 INFO rdd.NewHadoopRDD: Input split: hdfs://Ucluster/user/hadoop/item/data/oryx-1444139505000.data/part-r-00001:0+43446 15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19445 15/10/07 06:43:44 ERROR executor.Executor: Exception in task 5.1 in stage 106588.0 (TID 1583917) java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_19445_piece0 of broadcast_19445 ... 15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583918 15/10/07 06:43:44 INFO executor.Executor: Running task 6.0 in stage 106588.0 (TID 1583918) 15/10/07 06:43:44 INFO spark.CacheManager: Partition rdd_56436_2 not found, computing it 15/10/07 06:43:44 INFO rdd.NewHadoopRDD: Input split: hdfs://Ucluster/user/hadoop/item/data/oryx-1444139505000.data/part-r-00002:0+41145 15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19445 15/10/07 06:43:44 ERROR executor.Executor: Exception in task 6.0 in stage 106588.0 (TID 1583918) java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_19445_piece0 of broadcast_19445 ... ... 15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583927 15/10/07 06:43:44 INFO executor.Executor: Running task 0.0 in stage 106591.0 (TID 1583927) 15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19570 15/10/07 06:43:45 INFO storage.MemoryStore: ensureFreeSpace(2298) called with curMem=12261307, maxMem=556038881 15/10/07 06:43:45 INFO storage.MemoryStore: Block broadcast_19570_piece0 stored as bytes in memory (estimated size 2.2 KB, free 518.6 MB) 15/10/07 06:43:45 INFO broadcast.TorrentBroadcast: Reading broadcast variable 19570 took 3 ms 15/10/07 06:43:45 INFO storage.MemoryStore: ensureFreeSpace(4064) called with curMem=12263605, maxMem=556038881 15/10/07 06:43:45 INFO storage.MemoryStore: Block broadcast_19570 stored as values in memory (estimated size 4.0 KB, free 518.6 MB) 15/10/07 06:43:45 INFO kafka.KafkaRDD: Beginning offset 113207 is the same as ending offset skipping item_in 0 15/10/07 06:43:45 INFO executor.Executor: Finished task 0.0 in stage 106591.0 (TID 1583927). 610 bytes result sent to driver 15/10/07 06:43:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583928 ... ... 15/10/07 07:19:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1705138 15/10/07 07:19:49 INFO executor.Executor: Running task 3.0 in stage 111407.0 (TID 1705138) 15/10/07 07:19:49 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 20433 15/10/07 07:19:49 INFO storage.MemoryStore: ensureFreeSpace(19811) called with curMem=37484559, maxMem=556038881 15/10/07 07:19:49 INFO storage.MemoryStore: Block broadcast_20433_piece0 stored as bytes in memory (estimated size 19.3 KB, free 494.5 MB) 15/10/07 07:19:49 INFO broadcast.TorrentBroadcast: Reading broadcast variable 20433 took 5 ms 15/10/07 07:19:49 INFO storage.MemoryStore: ensureFreeSpace(43584) called with curMem=37504370, maxMem=556038881 15/10/07 07:19:49 INFO storage.MemoryStore: Block broadcast_20433 stored as values in memory (estimated size 42.6 KB, free 494.5 MB) 15/10/07 07:19:49 INFO storage.BlockManager: Found block rdd_59081_3 locally 15/10/07 07:19:49 INFO executor.Executor: Finished task 3.0 in stage 111407.0 (TID 1705138). 2988 bytes result sent to driver 15/10/07 07:19:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1705141 15/10/07 07:19:49 INFO executor.Executor: Running task 9.0 in stage 111407.0 (TID 1705141) 15/10/07 07:19:49 INFO storage.BlockManager: Found block rdd_59081_9 locally 15/10/07 07:19:49 INFO executor.Executor: Finished task 9.0 in stage 111407.0 (TID 1705141). 3002 bytes result sent to driver 15/10/07 07:19:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1705143 15/10/07 07:19:49 INFO executor.Executor: Running task 14.0 in stage 111407.0 (TID 1705143) 15/10/07 07:19:49 INFO spark.CacheManager: Partition rdd_59081_14 not found, computing it 15/10/07 07:19:49 INFO storage.BlockManager: Found block rdd_58862_14 locally 15/10/07 07:19:49 INFO storage.BlockManager: Found block rdd_58862_14 locally 15/10/07 07:19:49 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 6930, fetching them 15/10/07 07:19:49 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645]) 15/10/07 07:19:49 INFO spark.MapOutputTrackerWorker: Got the output locations 15/10/07 07:19:49 INFO storage.ShuffleBlockFetcherIterator: Getting 119 non-empty blocks out of 187 blocks 15/10/07 07:19:49 INFO client.TransportClientFactory: Found inactive connection to uhadoop-43nqrv-task2/10.10.158.33:56100, creating a new one. 15/10/07 07:19:49 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 22 outstanding blocks java.io.IOException: Failed to connect to uhadoop-43nqrv-task2/10.10.158.33:56100
I guess, the executor on task1 terminated on 02:17, so task2's get disconnected to it and throwed out connection exceptions.
But executors on task3 and core3 looked a bit strange, logs of which are much smaller than task1 and task2's.
Task3's throwed out connection exceptions at 02:36. Core1's throwed out connection exception at first, and then ERR-1 at 06:43.
Now, I get more puzzled...
Another question, taking this ALS algorithm as an example, are three layers of Oryx actually three applications on yarn? I found batch layer exited when application OryxBatchLayer-item ended. And I only see OryxBatchLayer-item and OryxSpeedLayer-item on the web page of yarn scheduler, but no OryxServingLayer-item?
About logs, I redirected the stdout of each layer to files, like "./oryx-run.sh batch --conf $Conf 2>&1 >> "$LogDir/$Date-batch".log &", which is extremely large!
-rw-r--r-- 1 hadoop wheel 340M Oct 1 15:32 2015-10-01-batch.log -rw-r--r-- 1 hadoop wheel 430K Oct 1 15:30 2015-10-01-serving.log -rw-r--r-- 1 hadoop wheel 2.7G Oct 5 11:13 2015-10-01-speed.log -rw-r--r-- 1 hadoop wheel 718M Oct 6 04:37 2015-10-05-batch.log -rw-r--r-- 1 hadoop wheel 288K Oct 6 19:25 2015-10-05-serving.log -rw-r--r-- 1 hadoop wheel 825M Oct 6 19:33 2015-10-05-speed.log -rw-r--r-- 1 hadoop wheel 662M Oct 7 07:43 2015-10-06-batch.log -rw-r--r-- 1 hadoop wheel 215K Oct 7 08:52 2015-10-06-serving.log -rw-r--r-- 1 hadoop wheel 120M Oct 7 14:38 2015-10-06-speed.log
Theoretically, serving layer is more idle than the other two. Why not send the outputs of batch and speed to serving layer, and serving write all outputs to files? I suggest creating a simple log module to deal with these outputs instead of just print them out, so that all three layer can run as daemons.
Thanks!