Member since
04-21-2017
5
Posts
0
Kudos Received
0
Solutions
01-01-2018
02:52 AM
Hi everyone, We have a cluster for real-time ETL with 60 datanode. (Each node with 128 GB Ram and 16 Core) We employ Kafka-direct-stream to read 1000 JSON messages/second (each message is 20KB on average) from Kafka and after parsing and validation inside Spark executers (100 executers) we write them to Kudu tables by using Kudu Java api. The batch interval is 20s. At the time the application is going to be started, we create a static Kudu client inside each executer and use them during the application life time and for each mini batch we create a separate Kudu session. Although we assigned enough resource for executers (Yarn containers with min 32GB memory and min 2 vCore), after a while (15 batch interval) the spark application stops working with the following exceptions (short format). I am wondering if anyone had such a problem or can give some hints. By the way, we are using Cloudera 5.13 and hosts' OS is CentOS. JDK 1.8_131 with G1GC enabled. [Stage 1:=======================================================>(99 + 1) / 100]
[Stage 1:=======================================================>(99 + 1) / 100]
[Stage 1:=======================================================>(99 + 1) / 100] 17/12/25 13:39:56 ERROR cluster.YarnClusterScheduler: Lost executor 8 on dnd14.: Container marked as failed: container_1514184901847_0003_02_000009 on host: dnd14. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143 17/12/25 12:11:22 WARN netty.NettyRpcEnv: Ignored failure: java.util.concurrent.TimeoutException: Cannot receive any reply from 172.18.77.155:39054 in 10 seconds
17/12/25 12:11:04 WARN client.ConnectToCluster: Error receiving response from 172.18.77.203:7051
org.apache.kudu.client.RecoverableException: connection disconnected
at org.apache.kudu.client.Connection.channelDisconnected(Connection.java:244) [Stage 0:> (0 + 7) / 100]
[Stage 0:> (0 + 😎 / 100]
17/12/25 15:24:12 ERROR server.TransportRequestHandler: Error sending result RpcResponse{requestId=8269938358736103401, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to /172.18.77.25:40844; closing connection
io.netty.handler.codec.EncoderException: java.lang.OutOfMemoryError: Java heap space
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:658)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:213)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:141)
at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:281)
at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:158) 17/12/25 15:54:33 WARN client.ConnectToCluster: Error receiving response from 172.18.77.14:7051
org.apache.kudu.client.RecoverableException: Service unavailable: ConnectToMaster request on kudu.master.MasterService from 172.18.77.167:54742 dropped due to backpressure. The service queue is full; it has 100 items.
at org.apache.kudu.client.Connection.messageReceived(Connection.java:371)
at I increased the service queue value to 200 but it seems that 200 also is not enough. Is there any drawback to increase this parameters to higher values?! RpcResponse{requestId=6127623072593482695, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /172.18.77.25:36854; closing connection
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
17/12/26 10:43:46 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /172.18.77.25:58328 is closed
17/12/26 10:43:46 ERROR spark.ContextCleaner: Error cleaning broadcast 35
org.apache.spark.SparkException: Exception thrown in awaitResult: 17/12/26 10:44:32 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 24 in stage 3.0 failed 4 times, most recent failure: Lost task 24.3 in stage 3.0 (TID 404, dnd49, executor 92): ExecutorLostFailure (executor 92 exited caused by one of the running tasks) Reason: Container marked as failed: container_1514269398853_0001_02_000096 on host: dnd49 Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal Container id: container_1514276028056_0001_01_000159
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
... View more
Labels: