Reply
New Contributor
Posts: 5
Registered: ‎08-20-2015

Oryx-2 crashed after runing for serval hour

OS: CentOS 6.5

 

CDH-5.4.5:

hbase(1.0.0+cdh5.4.5+175), hdfs(2.6.0+cdh5.4.5+626),

kafka(0.8.2.0+kafka1.3.1+85), spark(1.3.0+cdh5.4.5+47) ,

yarn(2.6.0+cdh5.4.5+626), zookeeper(3.4.5+cdh5.4.5+92

 

Oryx-2: beta 1 and beta 2

Config File: als-test.conf

# Copyright (c) 2014, Cloudera, Inc. All Rights Reserved.
#
# Cloudera, Inc. licenses this file to you under the Apache License,
# Version 2.0 (the "License"). You may not use this file except in
# compliance with the License. You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# This software is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
# CONDITIONS OF ANY KIND, either express or implied. See the License for
# the specific language governing permissions and limitations under the
# License.

# A very basic example config file configuring only the essential elements to
# run an ALS application.

# It's possible to specify reusable values:

kafka-brokers = "hadoop05:9092,hadoop06:9092,hadoop07:9092"
zk-servers = "hadoop05:2181,hadoop06:2181,hadoop07:2181"
hdfs-base = "hdfs://hadoop05:8020/user/bodao/Test"

oryx {
    id = "ALS-Test"

        input-topic {
            broker = ${kafka-brokers}

            lock = {
                master = ${zk-servers}
            }

            message = {
# Input topic
                topic = "Test-ALSIn"

# Key/message classes that the framework receives, respectively
                    key-class = "java.lang.String"
                    message-class = "java.lang.String"

# Decoder classes used to read/write key/message classes
                    key-decoder-class = "kafka.serializer.StringDecoder"
                    message-decoder-class = "kafka.serializer.StringDecoder"
            }
        }

    update-topic {
        broker = ${kafka-brokers}

        lock = {
            master = ${zk-servers}
        }

        message = {
# Update topic
            topic = "Test-ALSUp"

# Decoder/encoder classes used to read/write key/message classes
                decoder-class = "kafka.serializer.StringDecoder"
                encoder-class = "kafka.serializer.StringEncoder"

# Max size in bytes of message to write to the update topic. Don't change this unless you
# know what you're doing. PMML models larger than this will be passed a location on HDFS,
# for example. This should match the max.message.bytes configured for the update topic.
# This value can't be larger than about 64MB in any event.
                max-size = 16777216
        }
    }


    batch {
        streaming {
            generation-interval-sec = 20
                num-executors = 3
                executor-cores = 4
                executor-memory = "1g"
                #driver-memory = "384m"
                dynamic-allocation = true
        }

        update-class = "com.cloudera.oryx.app.batch.mllib.als.ALSUpdate"
            storage {
                data-dir =  ${hdfs-base}"/data/"
                    model-dir = ${hdfs-base}"/model/"
            }

        ui {
            port = 22220
        }
    }


    speed {

        streaming {
            generation-interval-sec = 20
                num-executors = 3
                executor-cores = 2
                executor-memory = "512m"
                #driver-memory = "384m"
                dynamic-allocation = true
        }

        model-manager-class = "com.cloudera.oryx.app.speed.als.ALSSpeedModelManager"

            ui {
                port = 22221
            }
    }


    serving {

	memory = "384m"
	yarn = {
	instances = 1
	core = "2"
	}

        model-manager-class = "com.cloudera.oryx.app.serving.als.model.ALSServingModelManager"
            application-resources = "com.cloudera.oryx.app.serving,com.cloudera.oryx.app.serving.als"

            api {
                port = 22222
            }
    }

}

 

Description: It has been an severe problem troubling me for a long time. Unfortunately I am still trapped in it because nobody seemed to encounter this nightmare before. Up to now, batch layer crashes frequently , speed sometimes exits after a long, and serving always works stably.

Following the guide on oryx.io, I succeeded to upload the movies-ration data to serving layer by wget, and batch and speed layers worked perfectly to return the computing result. So I ran three layer on backgroud, like

./oryx-run.sh batch --conf als-test.conf &

 

After serval hours, maybe 10 at first time, batch layer crashed with error (ERR-1):

Job aborted due to stage failure: Task 5 in stage 41329.0 failed 4 times, most recent failure: Lost task 5.3 in stage 41329.0 (TID 311412, hadoop05): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_9567_piece0 of broadcast_9567
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1155)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_9567_piece0 of broadcast_9567
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1152)
... 38 more

 

I searched for useful materials, and I found out a surpring news -- the variable "spark.cleaner.ttl" have a strong correlation with ERR-1. The Spark version of CDH stays on 1.3.0, an unstable release, so it turned out to be an unsolvable headache.

I tried to restart the oryx-2 every 3 hours. But another error (ERR-2) came out and batch crashed:

org.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer{file=/home/yarn/nm/usercache/bodao/appcache/application_1442445186451_0017/blockmgr-aeccd2a2-6225-4c02-9c33-bf4e531f056d/3d/shuffle_307_1022_0.data, offset=9403, length=4234}
	at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
	at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
	at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
	at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
	at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
	at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:61)
	at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:64)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Error in opening FileSegmentManagedBuffer{file=/home/yarn/nm/usercache/bodao/appcache/application_1442445186451_0017/blockmgr-aeccd2a2-6225-4c02-9c33-bf4e531f056d/3d/shuffle_307_1022_0.data, offset=9403, length=4234}
	at org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:113)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$3.apply(ShuffleBlockFetcherIterator.scala:300)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator$$anonfun$3.apply(ShuffleBlockFetcherIterator.scala:300)
	at scala.util.Try$.apply(Try.scala:161)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
	at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53)
	... 22 more
Caused by: java.io.FileNotFoundException: /home/yarn/nm/usercache/bodao/appcache/application_1442445186451_0017/blockmgr-aeccd2a2-6225-4c02-9c33-bf4e531f056d/3d/shuffle_307_1022_0.data (No such file or directory)
	at java.io.FileInputStream.open(Native Method)
	at java.io.FileInputStream.<init>(FileInputStream.java:146)
	at org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:98)
	... 27 more

 

and "spark.cleaner.ttl" seems to result in this. Worse Still, if I restart oryx-2 right after this two crash, batch layer will exit with similar error like ERR-1:

Job aborted due to stage failure: Task 6 in stage 497.0 failed 4 times, most recent failure: Lost task 6.3 in stage 497.0 (TID 25924, hadoop05): java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_42_piece0 of broadcast_42
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1155)
 at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)

...

 

Afterwards, kafka caused some problems to me since my servers halted due to an accident poweroff cut(ERR-3):

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 5211.0 failed 4 times, most recent failure: Lost task 5.3 in stage 5211.0 (TID 204446, hadoop06): com.esotericsoftware.kryo.KryoException: com.ning.compress.lzf.LZFException: Corrupt input data, block did not start with 2 byte signature ('ZV') followed by type byte, 2-byte length)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
at com.esotericsoftware.kryo.io.Input.require(Input.java:155)
at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138)
at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:60)
at org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.ning.compress.lzf.LZFException: Corrupt input data, block did not start with 2 byte signature ('ZV') followed by type byte, 2-byte length)
at com.ning.compress.lzf.ChunkDecoder._reportCorruptHeader(ChunkDecoder.java:267)
at com.ning.compress.lzf.impl.UnsafeChunkDecoder.decodeChunk(UnsafeChunkDecoder.java:55)
at com.ning.compress.lzf.LZFInputStream.readyBuffer(LZFInputStream.java:339)
at com.ning.compress.lzf.LZFInputStream.read(LZFInputStream.java:169)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:140)
... 43 more

 

I couldn't the corrupted topics because kafka ran with different broker ids after servers rebooted and I had to create some new topics

 

At first, I wandered if I upload data to kafka directly on my program mattered. But batch layer still crashed with ERR-1 after some time (about 6 hours) when wget puts different movie-rating data evrey 5 min in crontab. OMG, it did drive me crazy, and I made a conclusion that "spark.cleaner.ttl" on verison 1.3.0 was the evil reason of my problems

 

And now, I am testing oryx-2 on CDH with spark-1.4.1 compiled. Since oryx-2 is not stable enough and offers few Java algorithms, I consider to conbime PreditionIO with it, which may be a better practice

Cloudera Employee
Posts: 366
Registered: ‎07-29-2013

Re: Oryx-2 crashed after runing for serval hour

It sounds like you have a number of problems here. The first error
might indeed be caused by a too-short TTL, but, it's already set by
Oryx to 20x the batch interval, which ought to be long enough by far.

It could also be a Spark problem of some kind. 1.3.0 is not an
unstable release, but it's entirely possible a bug is fixed along the
line. You can use 1.5.1 in CDH 5.5.

It could be something else. Ideally we have a look at your executor
logs to see if anything else happened before? this bit sounds like a
symptom rather than cause.

The second error -- that too I honestly don't know. It means Spark
thinks its shuffle files disappeared for some reason. Are you running
the shuffle service? did you shut it down? any other errors? I have
not seen this problem in the Spark project in general.

The Kafka error is likely unrelated, but it again looks like data got
corrupted somewhere in Spark as well. It's a wild guess but if you had
a power failure or other hardware failure, is it possible some data
got corrupted on disk? or that the disk is having trouble?

You should use whatever software works best for you, and prediction.io
is a nice project. However, it also uses Spark, etc. I don't believe
this is an Oryx problem and therefore I am not convinced you would not
just have the same sorts of issues on your environment. Without more
detail, I am not sure I can help much more.

If you have a CDH support contract you might contact support about the
Spark errors themselves for help in troubleshooting.

Highlighted
New Contributor
Posts: 5
Registered: ‎08-20-2015

Re: Oryx-2 crashed after runing for serval hour

All these errors snippets above are from temporary logs on spark application web pages.

ERR-2: What do you mean by saying "Are you running the shuffle service?"
As far as I know, shuffle is a common step on spark job processes, but I have no idea about this
error too. It s=may be associated with the ERR-1.

Actually, I heard of some changes of ttl on spark later version. I tested oryx serval times on my new cluster(CDh 5.4.5 and spark 1.4.1).
However, ERR-1 still showed up with other errors, and the snippets followed are from executor logs, which are realy large.

-rw-r--r-- 1 hadoop wheel 225M Oct  7 08:56 uhadoop-43nqrv-core1_57289
-rw-r--r-- 1 hadoop wheel 346M Oct  7 08:56 uhadoop-43nqrv-task1_56246
-rw-r--r-- 1 hadoop wheel 565M Oct  7 08:56 uhadoop-43nqrv-task2_55836
-rw-r--r-- 1 hadoop wheel  92M Oct  7 08:56 uhadoop-43nqrv-task3_52174

You see, there were four executors -- core1, task1, task2, task3.

task1:

15/10/07 02:17:12 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 772878
15/10/07 02:17:12 INFO executor.Executor: Running task 62.0 in stage 70948.0 (TID 772878)
15/10/07 02:17:12 INFO storage.BlockManager: Found block rdd_36849_62 locally
15/10/07 02:17:12 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/10/07 02:17:12 INFO compress.CodecPool: Got brand-new compressor [.gz]
15/10/07 02:17:12 INFO output.FileOutputCommitter: Saved output of task 'attempt_201510070217_70948_m_000062_772878' to hdfs://Ucluster/user/hadoop/item/model/.temporary/1444155243829/0/Y/_temporary/0/task_201510070217_70948_m_000062
15/10/07 02:17:12 INFO mapred.SparkHadoopMapRedUtil: attempt_201510070217_70948_m_000062_772878: Committed
15/10/07 02:17:12 INFO executor.Executor: Finished task 62.0 in stage 70948.0 (TID 772878). 2054 bytes result sent to driver
15/10/07 02:17:12 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 772885
15/10/07 02:17:12 INFO executor.Executor: Running task 68.0 in stage 70948.0 (TID 772885)
15/10/07 02:17:12 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
15/10/07 02:17:12 INFO storage.DiskBlockManager: Shutdown hook called
15/10/07 02:17:12 INFO storage.BlockManager: Found block rdd_36849_68 locally
15/10/07 02:17:12 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/10/07 02:17:12 INFO compress.CodecPool: Got brand-new compressor [.gz]
15/10/07 02:17:12 INFO output.FileOutputCommitter: Saved output of task 'attempt_201510070217_70948_m_000068_772885' to hdfs://Ucluster/user/hadoop/item/model/.temporary/1444155243829/0/Y/_temporary/0/task_201510070217_70948_m_000068
15/10/07 02:17:12 INFO mapred.SparkHadoopMapRedUtil: attempt_201510070217_70948_m_000068_772885: Committed
15/10/07 02:17:12 INFO executor.Executor: Finished task 68.0 in stage 70948.0 (TID 772885). 2054 bytes result sent to driver

task2:

15/10/07 02:17:50 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/10/07 02:17:50 INFO compress.CodecPool: Got brand-new compressor [.gz]
15/10/07 02:17:50 INFO output.FileOutputCommitter: Saved output of task 'attempt_201510070217_70948_m_000140_773827' to hdfs://Ucluster/user/hadoop/item/model/.temporary/1444155243829/0/Y/_temporary/0/task_201510070217_70948_m_000140
15/10/07 02:17:50 INFO mapred.SparkHadoopMapRedUtil: attempt_201510070217_70948_m_000140_773827: Committed
15/10/07 02:17:50 INFO executor.Executor: Finished task 10.0 in stage 70948.1 (TID 773827). 2905 bytes result sent to driver
15/10/07 02:17:51 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 773833
15/10/07 02:17:51 INFO executor.Executor: Running task 0.0 in stage 70978.0 (TID 773833)
15/10/07 02:17:51 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 12989
15/10/07 02:17:51 INFO storage.MemoryStore: ensureFreeSpace(19557) called with curMem=15783408, maxMem=556038881
15/10/07 02:17:51 INFO storage.MemoryStore: Block broadcast_12989_piece0 stored as bytes in memory (estimated size 19.1 KB, free 515.2 MB)
15/10/07 02:17:51 INFO broadcast.TorrentBroadcast: Reading broadcast variable 12989 took 6 ms
15/10/07 02:17:51 INFO storage.MemoryStore: ensureFreeSpace(43280) called with curMem=15802965, maxMem=556038881
15/10/07 02:17:51 INFO storage.MemoryStore: Block broadcast_12989 stored as values in memory (estimated size 42.3 KB, free 515.2 MB)
15/10/07 02:17:51 INFO spark.CacheManager: Partition rdd_36848_0 not found, computing it
15/10/07 02:17:51 INFO storage.BlockManager: Found block rdd_36596_0 locally
15/10/07 02:17:51 INFO storage.BlockManager: Found block rdd_36596_0 locally
15/10/07 02:17:51 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 4425, fetching them
15/10/07 02:17:51 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645])
15/10/07 02:17:51 INFO spark.MapOutputTrackerWorker: Got the output locations
15/10/07 02:17:51 INFO storage.ShuffleBlockFetcherIterator: Getting 95 non-empty blocks out of 162 blocks
15/10/07 02:17:51 INFO client.TransportClientFactory: Found inactive connection to uhadoop-43nqrv-task1/10.10.185.149:47049, creating a new one.
15/10/07 02:17:51 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 18 outstanding blocks
15/10/07 02:17:51 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 18 outstanding blocks after 5000 ms
15/10/07 02:17:51 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 13 ms
15/10/07 02:17:56 INFO client.TransportClientFactory: Found inactive connection to uhadoop-43nqrv-task1/10.10.185.149:47049, creating a new one.
15/10/07 02:17:56 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 18 outstanding blocks (after 1 retries)
java.io.IOException: Failed to connect to uhadoop-43nqrv-task1/10.10.185.149:47049
...

task3:

5/10/07 02:36:32 INFO util.Utils: Fetching http://10.10.103.41:59384/jars/spark-examples-1.4.1-hadoop2.6.0-cdh5.4.4.jar to /data/yarn/local/usercache/hadoop/appcache/application_1444137930733_0009/fetchFileTemp2650740863846490596.tmp
15/10/07 02:36:32 INFO util.Utils: Copying /data/yarn/local/usercache/hadoop/appcache/application_1444137930733_0009/-1880841261444139405591_cache to /data/yarn/local/usercache/hadoop/appcache/application_1444137930733_0009/container_1444137930733_0009_01_000013/./spark-examples-1.4.1-hadoop2.6.0-cdh5.4.4.jar
15/10/07 02:36:32 INFO executor.Executor: Adding file:/data/yarn/local/usercache/hadoop/appcache/application_1444137930733_0009/container_1444137930733_0009_01_000013/./spark-examples-1.4.1-hadoop2.6.0-cdh5.4.4.jar to class loader
15/10/07 02:36:32 INFO spark.MapOutputTrackerWorker: Updating epoch to 4767 and clearing cache
15/10/07 02:36:32 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 13622
15/10/07 02:36:32 INFO storage.MemoryStore: ensureFreeSpace(5809) called with curMem=0, maxMem=556038881
15/10/07 02:36:32 INFO storage.MemoryStore: Block broadcast_13622_piece0 stored as bytes in memory (estimated size 5.7 KB, free 530.3 MB)
15/10/07 02:36:32 INFO broadcast.TorrentBroadcast: Reading broadcast variable 13622 took 116 ms
15/10/07 02:36:33 INFO storage.MemoryStore: ensureFreeSpace(11720) called with curMem=5809, maxMem=556038881
15/10/07 02:36:33 INFO storage.MemoryStore: Block broadcast_13622 stored as values in memory (estimated size 11.4 KB, free 530.3 MB)
15/10/07 02:36:33 INFO spark.CacheManager: Partition rdd_38540_33 not found, computing it
15/10/07 02:36:33 INFO spark.CacheManager: Partition rdd_38524_33 not found, computing it
15/10/07 02:36:33 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 4620, fetching them
15/10/07 02:36:33 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645])
15/10/07 02:36:33 INFO spark.MapOutputTrackerWorker: Got the output locations
15/10/07 02:36:33 INFO storage.ShuffleBlockFetcherIterator: Getting 79 non-empty blocks out of 328 blocks
15/10/07 02:36:33 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 10 outstanding blocks
java.io.IOException: Failed to connect to uhadoop-43nqrv-task2/10.10.158.33:58676
...

15/10/07 02:36:33 INFO shuffle.RetryingBlockFetcher: Retrying fetch (1/3) for 10 outstanding blocks after 5000 ms
15/10/07 02:36:33 INFO storage.ShuffleBlockFetcherIterator: Started 6 remote fetches in 77 ms
15/10/07 02:36:38 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 10 outstanding blocks (after 1 retries)
java.io.IOException: Failed to connect to uhadoop-43nqrv-task2/10.10.158.33:58676
...
...
15/10/07 02:36:48 ERROR storage.ShuffleBlockFetcherIterator: Failed to get block(s) from uhadoop-43nqrv-task2:58676
java.io.IOException: Failed to connect to uhadoop-43nqrv-task2/10.10.158.33:58676
...

15/10/07 02:36:51 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 840756
15/10/07 02:36:51 INFO executor.Executor: Running task 3.0 in stage 73994.1 (TID 840756)
15/10/07 02:36:51 INFO spark.MapOutputTrackerWorker: Updating epoch to 4778 and clearing cache
15/10/07 02:36:51 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 13624
15/10/07 02:36:51 INFO storage.MemoryStore: ensureFreeSpace(5809) called with curMem=17529, maxMem=556038881
15/10/07 02:36:51 INFO storage.MemoryStore: Block broadcast_13624_piece0 stored as bytes in memory (estimated size 5.7 KB, free 530.3 MB)
15/10/07 02:36:51 INFO broadcast.TorrentBroadcast: Reading broadcast variable 13624 took 10 ms
15/10/07 02:36:51 INFO storage.MemoryStore: ensureFreeSpace(11720) called with curMem=23338, maxMem=556038881
15/10/07 02:36:51 INFO storage.MemoryStore: Block broadcast_13624 stored as values in memory (estimated size 11.4 KB, free 530.2 MB)
15/10/07 02:36:51 INFO spark.CacheManager: Partition rdd_38540_23 not found, computing it
15/10/07 02:36:51 INFO spark.CacheManager: Partition rdd_38524_23 not found, computing it
15/10/07 02:36:51 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 4620, fetching them
15/10/07 02:36:51 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645])
15/10/07 02:36:51 INFO spark.MapOutputTrackerWorker: Got the output locations
15/10/07 02:36:51 INFO storage.ShuffleBlockFetcherIterator: Getting 80 non-empty blocks out of 328 blocks
15/10/07 02:36:51 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 19 ms
15/10/07 02:36:51 INFO storage.MemoryStore: ensureFreeSpace(17024) called with curMem=35058, maxMem=556038881
15/10/07 02:36:51 INFO storage.MemoryStore: Block rdd_38524_23 stored as values in memory (estimated size 16.6 KB, free 530.2 MB)
15/10/07 02:36:51 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
15/10/07 02:36:51 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
15/10/07 02:36:51 INFO storage.MemoryStore: ensureFreeSpace(12984) called with curMem=52082, maxMem=556038881
15/10/07 02:36:51 INFO storage.MemoryStore: Block rdd_38540_23 stored as values in memory (estimated size 12.7 KB, free 530.2 MB)
15/10/07 02:36:51 INFO executor.Executor: Finished task 3.0 in stage 73994.1 (TID 840756). 2158 bytes result sent to driver
15/10/07 02:36:51 INFO storage.BlockManager: Removing RDD 38541
15/10/07 02:36:51 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 840783
...

core1:

15/10/07 06:42:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1581651
15/10/07 06:42:49 INFO executor.Executor: Running task 7.0 in stage 106533.0 (TID 1581651)
15/10/07 06:42:49 INFO spark.CacheManager: Partition rdd_56722_7 not found, computing it
15/10/07 06:42:49 INFO storage.BlockManager: Found block rdd_56464_7 locally
15/10/07 06:42:49 INFO storage.BlockManager: Found block rdd_56464_7 locally
15/10/07 06:42:49 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 6625, fetching them
15/10/07 06:42:49 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645])
15/10/07 06:42:49 INFO spark.MapOutputTrackerWorker: Got the output locations
15/10/07 06:42:49 INFO storage.ShuffleBlockFetcherIterator: Getting 121 non-empty blocks out of 183 blocks
15/10/07 06:42:49 INFO client.TransportClientFactory: Found inactive connection to uhadoop-43nqrv-task3/10.10.103.41:56230, creating a new one.
15/10/07 06:42:49 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 24 outstanding blocks
java.io.IOException: Failed to connect to uhadoop-43nqrv-task3/10.10.103.41:56230
...
...

15/10/07 06:43:04 ERROR storage.ShuffleBlockFetcherIterator: Failed to get block(s) from uhadoop-43nqrv-task3:56230
java.io.IOException: Failed to connect to uhadoop-43nqrv-task3/10.10.103.41:56230
...
...

15/10/07 06:43:04 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1581660
15/10/07 06:43:04 INFO executor.Executor: Running task 1.0 in stage 106532.0 (TID 1581660)
15/10/07 06:43:04 INFO spark.MapOutputTrackerWorker: Updating epoch to 6927 and clearing cache
15/10/07 06:43:04 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19536
15/10/07 06:43:04 INFO storage.MemoryStore: ensureFreeSpace(18798) called with curMem=62163615, maxMem=556038881
15/10/07 06:43:04 INFO storage.MemoryStore: Block broadcast_19536_piece0 stored as bytes in memory (estimated size 18.4 KB, free 471.0 MB)
15/10/07 06:43:04 INFO broadcast.TorrentBroadcast: Reading broadcast variable 19536 took 5 ms
15/10/07 06:43:04 INFO storage.MemoryStore: ensureFreeSpace(41400) called with curMem=62182413, maxMem=556038881
15/10/07 06:43:04 INFO storage.MemoryStore: Block broadcast_19536 stored as values in memory (estimated size 40.4 KB, free 470.9 MB)
15/10/07 06:43:04 INFO storage.BlockManager: Found block rdd_56479_6 locally
15/10/07 06:43:04 INFO spark.CacheManager: Partition rdd_56698_6 not found, computing it
15/10/07 06:43:04 INFO storage.BlockManager: Found block rdd_56478_6 locally
15/10/07 06:43:04 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 6624, fetching them
15/10/07 06:43:04 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645])
15/10/07 06:43:04 INFO spark.MapOutputTrackerWorker: Got the output locations
15/10/07 06:43:04 INFO storage.ShuffleBlockFetcherIterator: Getting 183 non-empty blocks out of 183 blocks
15/10/07 06:43:04 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 4 ms
15/10/07 06:43:04 INFO storage.MemoryStore: ensureFreeSpace(144) called with curMem=62223813, maxMem=556038881
15/10/07 06:43:04 INFO storage.MemoryStore: Block rdd_56698_6 stored as values in memory (estimated size 144.0 B, free 470.9 MB)
15/10/07 06:43:04 INFO executor.Executor: Finished task 1.0 in stage 106532.0 (TID 1581660). 2927 bytes result sent to driver
15/10/07 06:43:04 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1581667
15/10/07 06:43:04 INFO executor.Executor: Running task 11.0 in stage 106532.0 (TID 1581667)
15/10/07 06:43:04 INFO storage.BlockManager: Found block rdd_56479_51 locally
15/10/07 06:43:04 INFO spark.CacheManager: Partition rdd_56698_51 not found, computing it
15/10/07 06:43:04 INFO storage.BlockManager: Found block rdd_56478_51 locally
15/10/07 06:43:04 INFO storage.ShuffleBlockFetcherIterator: Getting 183 non-empty blocks out of 183 blocks
15/10/07 06:43:04 INFO storage.ShuffleBlockFetcherIterator: Started 5 remote fetches in 5 ms
15/10/07 06:43:04 INFO storage.MemoryStore: ensureFreeSpace(16) called with curMem=62223957, maxMem=556038881
15/10/07 06:43:04 INFO storage.MemoryStore: Block rdd_56698_51 stored as values in memory (estimated size 16.0 B, free 470.9 MB)
15/10/07 06:43:04 INFO executor.Executor: Finished task 11.0 in stage 106532.0 (TID 1581667). 2927 bytes result sent to driver
15/10/07 06:43:04 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1581668
...
...
15/10/07 06:43:44 WARN common.AppInfo$: Can't read Kafka version from MANIFEST.MF. Possible cause: java.lang.NullPointerException
15/10/07 06:43:44 INFO producer.Producer: Shutting down producer
15/10/07 06:43:44 INFO async.ProducerSendThread: Begin shutting down ProducerSendThread
15/10/07 06:43:44 INFO client.ClientUtils$: Fetching metadata from broker id:0,host:uhadoop-43nqrv-master1,port:9092 with correlation id 0 for 1 topic(s) Set(item_up)
15/10/07 06:43:44 INFO producer.SyncProducer: Connected to uhadoop-43nqrv-master1:9092 for producing
15/10/07 06:43:44 INFO producer.SyncProducer: Disconnecting from uhadoop-43nqrv-master1:9092
15/10/07 06:43:44 INFO producer.SyncProducer: Connected to uhadoop-43nqrv-master2:9092 for producing
15/10/07 06:43:44 INFO async.ProducerSendThread: Shutdown ProducerSendThread complete
15/10/07 06:43:44 INFO producer.ProducerPool: Closing all sync producers
15/10/07 06:43:44 INFO producer.SyncProducer: Disconnecting from uhadoop-43nqrv-master2:9092
15/10/07 06:43:44 INFO producer.Producer: Producer shutdown completed in 4 ms
15/10/07 06:43:44 INFO executor.Executor: Finished task 180.0 in stage 106587.0 (TID 1583910). 1761 bytes result sent to driver
15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583912
15/10/07 06:43:44 INFO executor.Executor: Running task 182.0 in stage 106587.0 (TID 1583912)
15/10/07 06:43:44 INFO rdd.HadoopRDD: Input split: hdfs://Ucluster/user/hadoop/item/model/1444171422997/Y/part-00182.gz:0+20
15/10/07 06:43:44 INFO executor.Executor: Finished task 182.0 in stage 106587.0 (TID 1583912). 1761 bytes result sent to driver
15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583913
15/10/07 06:43:44 INFO executor.Executor: Running task 1.0 in stage 106588.0 (TID 1583913)
15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19567
15/10/07 06:43:44 INFO storage.MemoryStore: ensureFreeSpace(3744) called with curMem=12250491, maxMem=556038881
15/10/07 06:43:44 INFO storage.MemoryStore: Block broadcast_19567_piece0 stored as bytes in memory (estimated size 3.7 KB, free 518.6 MB)
15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Reading broadcast variable 19567 took 3 ms
15/10/07 06:43:44 INFO storage.MemoryStore: ensureFreeSpace(7072) called with curMem=12254235, maxMem=556038881
15/10/07 06:43:44 INFO storage.MemoryStore: Block broadcast_19567 stored as values in memory (estimated size 6.9 KB, free 518.6 MB)
15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Reading broadcast variable 19567 took 3 ms
15/10/07 06:43:44 INFO storage.MemoryStore: ensureFreeSpace(7072) called with curMem=12254235, maxMem=556038881
15/10/07 06:43:44 INFO storage.MemoryStore: Block broadcast_19567 stored as values in memory (estimated size 6.9 KB, free 518.6 MB)
15/10/07 06:43:44 INFO storage.BlockManager: Found block rdd_56435_1 locally
15/10/07 06:43:44 INFO executor.Executor: Finished task 1.0 in stage 106588.0 (TID 1583913). 2171 bytes result sent to driver
15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583916
15/10/07 06:43:44 INFO executor.Executor: Running task 5.0 in stage 106588.0 (TID 1583916)
15/10/07 06:43:44 INFO spark.CacheManager: Partition rdd_56436_1 not found, computing it
15/10/07 06:43:44 INFO rdd.NewHadoopRDD: Input split: hdfs://Ucluster/user/hadoop/item/data/oryx-1444139505000.data/part-r-00001:0+43446
15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19445
15/10/07 06:43:44 ERROR executor.Executor: Exception in task 5.0 in stage 106588.0 (TID 1583916)
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_19445_piece0 of broadcast_19445
...
15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583917
15/10/07 06:43:44 INFO executor.Executor: Running task 5.1 in stage 106588.0 (TID 1583917)
15/10/07 06:43:44 INFO spark.CacheManager: Partition rdd_56436_1 not found, computing it
15/10/07 06:43:44 INFO rdd.NewHadoopRDD: Input split: hdfs://Ucluster/user/hadoop/item/data/oryx-1444139505000.data/part-r-00001:0+43446
15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19445
15/10/07 06:43:44 ERROR executor.Executor: Exception in task 5.1 in stage 106588.0 (TID 1583917)
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_19445_piece0 of broadcast_19445
...

15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583918
15/10/07 06:43:44 INFO executor.Executor: Running task 6.0 in stage 106588.0 (TID 1583918)
15/10/07 06:43:44 INFO spark.CacheManager: Partition rdd_56436_2 not found, computing it
15/10/07 06:43:44 INFO rdd.NewHadoopRDD: Input split: hdfs://Ucluster/user/hadoop/item/data/oryx-1444139505000.data/part-r-00002:0+41145
15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19445
15/10/07 06:43:44 ERROR executor.Executor: Exception in task 6.0 in stage 106588.0 (TID 1583918)
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_19445_piece0 of broadcast_19445
...
...

15/10/07 06:43:44 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583927
15/10/07 06:43:44 INFO executor.Executor: Running task 0.0 in stage 106591.0 (TID 1583927)
15/10/07 06:43:44 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 19570
15/10/07 06:43:45 INFO storage.MemoryStore: ensureFreeSpace(2298) called with curMem=12261307, maxMem=556038881
15/10/07 06:43:45 INFO storage.MemoryStore: Block broadcast_19570_piece0 stored as bytes in memory (estimated size 2.2 KB, free 518.6 MB)
15/10/07 06:43:45 INFO broadcast.TorrentBroadcast: Reading broadcast variable 19570 took 3 ms
15/10/07 06:43:45 INFO storage.MemoryStore: ensureFreeSpace(4064) called with curMem=12263605, maxMem=556038881
15/10/07 06:43:45 INFO storage.MemoryStore: Block broadcast_19570 stored as values in memory (estimated size 4.0 KB, free 518.6 MB)
15/10/07 06:43:45 INFO kafka.KafkaRDD: Beginning offset 113207 is the same as ending offset skipping item_in 0
15/10/07 06:43:45 INFO executor.Executor: Finished task 0.0 in stage 106591.0 (TID 1583927). 610 bytes result sent to driver
15/10/07 06:43:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1583928
...
...

15/10/07 07:19:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1705138
15/10/07 07:19:49 INFO executor.Executor: Running task 3.0 in stage 111407.0 (TID 1705138)
15/10/07 07:19:49 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 20433
15/10/07 07:19:49 INFO storage.MemoryStore: ensureFreeSpace(19811) called with curMem=37484559, maxMem=556038881
15/10/07 07:19:49 INFO storage.MemoryStore: Block broadcast_20433_piece0 stored as bytes in memory (estimated size 19.3 KB, free 494.5 MB)
15/10/07 07:19:49 INFO broadcast.TorrentBroadcast: Reading broadcast variable 20433 took 5 ms
15/10/07 07:19:49 INFO storage.MemoryStore: ensureFreeSpace(43584) called with curMem=37504370, maxMem=556038881
15/10/07 07:19:49 INFO storage.MemoryStore: Block broadcast_20433 stored as values in memory (estimated size 42.6 KB, free 494.5 MB)
15/10/07 07:19:49 INFO storage.BlockManager: Found block rdd_59081_3 locally
15/10/07 07:19:49 INFO executor.Executor: Finished task 3.0 in stage 111407.0 (TID 1705138). 2988 bytes result sent to driver
15/10/07 07:19:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1705141
15/10/07 07:19:49 INFO executor.Executor: Running task 9.0 in stage 111407.0 (TID 1705141)
15/10/07 07:19:49 INFO storage.BlockManager: Found block rdd_59081_9 locally
15/10/07 07:19:49 INFO executor.Executor: Finished task 9.0 in stage 111407.0 (TID 1705141). 3002 bytes result sent to driver
15/10/07 07:19:49 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1705143
15/10/07 07:19:49 INFO executor.Executor: Running task 14.0 in stage 111407.0 (TID 1705143)
15/10/07 07:19:49 INFO spark.CacheManager: Partition rdd_59081_14 not found, computing it
15/10/07 07:19:49 INFO storage.BlockManager: Found block rdd_58862_14 locally
15/10/07 07:19:49 INFO storage.BlockManager: Found block rdd_58862_14 locally
15/10/07 07:19:49 INFO spark.MapOutputTrackerWorker: Don't have map outputs for shuffle 6930, fetching them
15/10/07 07:19:49 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker endpoint = AkkaRpcEndpointRef(Actor[akka.tcp://sparkDriver@10.10.103.41:50672/user/MapOutputTracker#-1427047645])
15/10/07 07:19:49 INFO spark.MapOutputTrackerWorker: Got the output locations
15/10/07 07:19:49 INFO storage.ShuffleBlockFetcherIterator: Getting 119 non-empty blocks out of 187 blocks
15/10/07 07:19:49 INFO client.TransportClientFactory: Found inactive connection to uhadoop-43nqrv-task2/10.10.158.33:56100, creating a new one.
15/10/07 07:19:49 ERROR shuffle.RetryingBlockFetcher: Exception while beginning fetch of 22 outstanding blocks
java.io.IOException: Failed to connect to uhadoop-43nqrv-task2/10.10.158.33:56100

I guess, the executor on task1 terminated on 02:17, so task2's get disconnected to it and throwed out connection exceptions.
But executors on task3 and core3 looked a bit strange, logs of which are much smaller than task1 and task2's.
Task3's throwed out connection exceptions at 02:36. Core1's throwed out connection exception at first, and then ERR-1 at 06:43.
Now, I get more puzzled...

Another question, taking this ALS algorithm as an example, are three layers of Oryx actually three applications on yarn? I found batch layer exited when application OryxBatchLayer-item ended. And I only see OryxBatchLayer-item and OryxSpeedLayer-item on the web page of yarn scheduler, but no OryxServingLayer-item?

About logs, I redirected the stdout of each layer to files, like "./oryx-run.sh batch --conf $Conf 2>&1 >> "$LogDir/$Date-batch".log &", which is extremely large!

 

-rw-r--r-- 1 hadoop wheel 340M Oct  1 15:32 2015-10-01-batch.log
-rw-r--r-- 1 hadoop wheel 430K Oct  1 15:30 2015-10-01-serving.log
-rw-r--r-- 1 hadoop wheel 2.7G Oct  5 11:13 2015-10-01-speed.log
-rw-r--r-- 1 hadoop wheel 718M Oct  6 04:37 2015-10-05-batch.log
-rw-r--r-- 1 hadoop wheel 288K Oct  6 19:25 2015-10-05-serving.log
-rw-r--r-- 1 hadoop wheel 825M Oct  6 19:33 2015-10-05-speed.log
-rw-r--r-- 1 hadoop wheel 662M Oct  7 07:43 2015-10-06-batch.log
-rw-r--r-- 1 hadoop wheel 215K Oct  7 08:52 2015-10-06-serving.log
-rw-r--r-- 1 hadoop wheel 120M Oct  7 14:38 2015-10-06-speed.log

Theoretically, serving layer is more idle than the other two. Why not send the outputs of batch and speed to serving layer, and serving write all outputs to files? I suggest creating a simple log module to deal with these outputs instead of just print them out, so that all three layer can run as daemons.

Thanks!

 

 

Announcements