Support Questions

Find answers, ask questions, and share your expertise

Why YarnClientSchedulerBackend: Yarn application has already exited with state FAILED!

avatar
Contributor

I am using HDP-3.1 (3.1.0.0-78), and I have enabled the Spark Thrift Server to serve queries from HDFS. After encountering an error, I can only reconnect to the Thrift Server by restarting the service. I still don't know the root cause. I will provide the error details below.

23/05/15 14:37:38 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[hidden.
11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]], original=[DatanodeInfoWithStorage[10.
210.11.11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]]). The current failed datanode r
eplacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
23/05/15 14:38:38 INFO ThriftCLIService: Client protocol version: HIVE_CLI_SERVICE_PROTOCOL_V8
23/05/15 14:38:38 INFO SessionState: Created local directory: /tmp/0b042f0f-e760-449c-8fb8-a6cd3391cb01_resources
23/05/15 14:38:38 INFO SessionState: Created HDFS directory: /tmp/spark/spark/0b042f0f-e760-449c-8fb8-a6cd3391cb01
23/05/15 14:38:38 INFO SessionState: Created local directory: /tmp/spark/0b042f0f-e760-449c-8fb8-a6cd3391cb01
23/05/15 14:38:38 INFO SessionState: Created HDFS directory: /tmp/spark/spark/0b042f0f-e760-449c-8fb8-a6cd3391cb01/_tmp_space.db
23/05/15 14:38:38 INFO HiveSessionImpl: Operation log session directory is created: /tmp/spark/operation_logs/0b042f0f-e760-449c-8fb8-a6cd3391cb01
23/05/15 14:38:38 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[hidden.
11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]], original=[DatanodeInfoWithStorage[10.
210.11.11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]]). The current failed datanode r
eplacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
23/05/15 14:38:38 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[hidden.
11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]], original=[DatanodeInfoWithStorage[10.
210.11.11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]]). The current failed datanode r
eplacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
23/05/15 14:39:04 INFO Client: Deleted staging directory hdfs://fpro-pti-hadoop-01:8020/user/spark/.sparkStaging/application_1681374888849_0135
23/05/15 14:39:04 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FAILED!
23/05/15 14:39:04 INFO HiveServer2: Shutting down HiveServer2
23/05/15 14:39:04 INFO ThriftCLIService: Thrift server has stopped
23/05/15 14:39:04 INFO AbstractService: Service:ThriftBinaryCLIService is stopped.
23/05/15 14:39:04 INFO AbstractService: Service:OperationManager is stopped.
23/05/15 14:39:04 INFO AbstractService: Service:SessionManager is stopped.
23/05/15 14:39:04 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[hidden.
11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]], original=[DatanodeInfoWithStorage[10.
210.11.11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]]). The current failed datanode r
eplacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
23/05/15 14:39:04 INFO AbstractConnector: Stopped Spark@5a6d5a8f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
23/05/15 14:39:04 INFO SparkUI: Stopped Spark web UI at http://fpro-pti-hadoop-05:4040
23/05/15 14:39:05 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 193.
23/05/15 14:39:05 INFO DAGScheduler: Executor lost: 193 (epoch 83)
23/05/15 14:39:05 INFO BlockManagerMasterEndpoint: Trying to remove executor 193 from BlockManagerMaster.
23/05/15 14:39:05 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(193, fpro-pti-hadoop-06, 35562, None)
23/05/15 14:39:05 INFO BlockManagerMaster: Removed 193 successfully in removeExecutor
23/05/15 14:39:05 ERROR TransportClient: Failed to send RPC 8099143512573728922 to /hidden.16:36858: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
23/05/15 14:39:05 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to get executor loss reason for executor id 193 at RPC address hidden.15:57056, but got no response. Mar
king as slave lost.
java.io.IOException: Failed to send RPC 8099143512573728922 to /hidden.16:36858: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
23/05/15 14:39:05 ERROR YarnScheduler: Lost executor 193 on fpro-pti-hadoop-06: Slave lost
23/05/15 14:39:14 INFO AbstractService: Service:CLIService is stopped.
23/05/15 14:39:14 INFO AbstractService: Service:HiveServer2 is stopped.
23/05/15 14:39:14 ERROR Utils: Uncaught exception in thread Yarn application state monitor
java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1072)
at java.io.BufferedWriter.close(BufferedWriter.java:266)
at java.io.PrintWriter.close(PrintWriter.java:339)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$stop$1.apply(EventLoggingListener.scala:242)
at org.apache.spark.scheduler.EventLoggingListener$$anonfun$stop$1.apply(EventLoggingListener.scala:242)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:242)
at org.apache.spark.SparkContext$$anonfun$stop$7$$anonfun$apply$mcV$sp$5.apply(SparkContext.scala:1927)
at org.apache.spark.SparkContext$$anonfun$stop$7$$anonfun$apply$mcV$sp$5.apply(SparkContext.scala:1927)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1927)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1361)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1926)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:112)
Caused by: java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage
[hidden.11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]], original=[DatanodeInfoWith
Storage[hidden.11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]]). The current failed
datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)
23/05/15 14:39:15 ERROR TransportClient: Failed to send RPC 4995455936492514826 to /hidden.16:36858: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
23/05/15 14:39:15 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC 4995455936492514826 to /hidden.16:36858: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
23/05/15 14:39:15 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
services=List(),
started=false)
23/05/15 14:39:15 ERROR Utils: Uncaught exception in thread Yarn application state monitor
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:565)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(YarnSchedulerBackend.scala:95)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:155)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:508)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1804)
at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1361)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1930)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:112)
Caused by: java.io.IOException: Failed to send RPC 4995455936492514826 to /hidden.16:36858: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient.lambda$sendRpc$2(TransportClient.java:237)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
23/05/15 14:39:15 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/05/15 14:39:15 INFO MemoryStore: MemoryStore cleared
23/05/15 14:39:15 INFO BlockManager: BlockManager stopped
23/05/15 14:39:15 INFO BlockManagerMaster: BlockManagerMaster stopped
23/05/15 14:39:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/05/15 14:39:16 INFO SparkContext: Successfully stopped SparkContext

 

 

5 REPLIES 5

avatar
Community Manager

@sonnh Welcome to our community! To help you get the best possible answer, I have tagged our Spark/ HDFS experts @RangaReddy @smdas @Asok who may be able to assist you further.

Please feel free to provide any additional information or details about your query, and we hope that you will find a satisfactory solution to your question.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Contributor

Tks Vidya, Here, my team is using Spark Thrift Server. There are a lot of queries being used, so I am not sure which specific query leads to this situation. After a period of usage, we encountered this issue again, and we had to restart the service :((

avatar
Master Collaborator

Hi @sonnh 

 

Based on the following exception, its looks like due to datanode issues it is causing the issue. You can do one thing disable event log and submit the spark application. Still if you see the below exception better you can create a Cloudera case with hdfs component we will look into this issue.

 

spark-submit --conf spark.eventLog.enabled=false

 

 

23/05/15 14:39:04 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[hidden.
11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]], original=[DatanodeInfoWithStorage[10.
210.11.11:50010,DS-3f249b69-b437-47ca-8433-b8305db6ea7f,DISK], DatanodeInfoWithStorage[hidden.10:50010,DS-db35488d-9ea2-45c5-938c-2f78b0b9ad5a,DISK]]). The current failed datanode r
eplacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1304)
at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1372)
at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1598)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1499)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1481)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1256)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:667)

 

  

avatar
Contributor

tks RangaReddy, Not only in my team but in many other companies, they also encounter this issue with Spark Thrift Server. Do I need to provide any additional information to create a Cloudera case based on the description above?

avatar
Master Collaborator

Hi @sonnh 

You can go ahead the raise the hdfs case by uploading all required logs like NM logs and DN logs.