About Sattari

Sattari · ‎01-01-2018

Hi everyone, We have a cluster for real-time ETL with 60 datanode. (Each node with 128 GB Ram and 16 Core) We employ Kafka-direct-stream to read 1000 JSON messages/second (each message is 20KB on average) from Kafka and after parsing and validation inside Spark executers (100 executers) we write them to Kudu tables by using Kudu Java api. The batch interval is 20s. At the time the application is going to be started, we create a static Kudu client inside each executer and use them during the application life time and for each mini batch we create a separate Kudu session. Although we assigned enough resource for executers (Yarn containers with min 32GB memory and min 2 vCore), after a while (15 batch interval) the spark application stops working with the following exceptions (short format). I am wondering if anyone had such a problem or can give some hints. By the way, we are using Cloudera 5.13 and hosts' OS is CentOS. JDK 1.8_131 with G1GC enabled. [Stage 1:=======================================================>(99 + 1) / 100] [Stage 1:=======================================================>(99 + 1) / 100] [Stage 1:=======================================================>(99 + 1) / 100] 17/12/25 13:39:56 ERROR cluster.YarnClusterScheduler: Lost executor 8 on dnd14.: Container marked as failed: container_1514184901847_0003_02_000009 on host: dnd14. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 17/12/25 12:11:22 WARN netty.NettyRpcEnv: Ignored failure: java.util.concurrent.TimeoutException: Cannot receive any reply from 172.18.77.155:39054 in 10 seconds 17/12/25 12:11:04 WARN client.ConnectToCluster: Error receiving response from 172.18.77.203:7051 org.apache.kudu.client.RecoverableException: connection disconnected at org.apache.kudu.client.Connection.channelDisconnected(Connection.java:244) [Stage 0:> (0 + 7) / 100] [Stage 0:> (0 + 😎 / 100] 17/12/25 15:24:12 ERROR server.TransportRequestHandler: Error sending result RpcResponse{requestId=8269938358736103401, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to /172.18.77.25:40844; closing connection io.netty.handler.codec.EncoderException: java.lang.OutOfMemoryError: Java heap space at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:106) io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.OutOfMemoryError: Java heap space at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:658) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:237) at io.netty.buffer.PoolArena.allocate(PoolArena.java:213) at io.netty.buffer.PoolArena.allocate(PoolArena.java:141) at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:281) at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:158) 17/12/25 15:54:33 WARN client.ConnectToCluster: Error receiving response from 172.18.77.14:7051 org.apache.kudu.client.RecoverableException: Service unavailable: ConnectToMaster request on kudu.master.MasterService from 172.18.77.167:54742 dropped due to backpressure. The service queue is full; it has 100 items. at org.apache.kudu.client.Connection.messageReceived(Connection.java:371) at I increased the service queue value to 200 but it seems that 200 also is not enough. Is there any drawback to increase this parameters to higher values?! RpcResponse{requestId=6127623072593482695, body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=81 cap=156]}} to /172.18.77.25:36854; closing connection java.nio.channels.ClosedChannelException at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 17/12/26 10:43:46 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /172.18.77.25:58328 is closed 17/12/26 10:43:46 ERROR spark.ContextCleaner: Error cleaning broadcast 35 org.apache.spark.SparkException: Exception thrown in awaitResult: 17/12/26 10:44:32 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 24 in stage 3.0 failed 4 times, most recent failure: Lost task 24.3 in stage 3.0 (TID 404, dnd49, executor 92): ExecutorLostFailure (executor 92 exited caused by one of the running tasks) Reason: Container marked as failed: container_1514269398853_0001_02_000096 on host: dnd49 Exit status: 143. Diagnostics: Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Killed by external signal Container id: container_1514276028056_0001_01_000159 Exit code: 52 Stack trace: ExitCodeException exitCode=52: at org.apache.hadoop.util.Shell.runCommand(Shell.java:604) at org.apache.hadoop.util.Shell.run(Shell.java:507) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)

Online	Offline
Last Visited	‎12-27-2019 10:24 AM

Member Since	‎04-21-2017 12:39 AM
Last Visited	‎12-27-2019 10:24 AM
Posts	5

Cloudera Community

Spark streaming application Exceptions