Member since
10-20-2016
7
Posts
0
Kudos Received
0
Solutions
04-03-2017
06:28 AM
@Namit Maheshwari - Yeah in default case it wouldn't require restart. But i want to manually restart it actually i have two instances of history server and managing them using an external monitoring service so that's why i require a restart at least of history server when the node which is running history server goes down. I have two DataNodes in clutser and yes one is running on host which i want to shutdown.
... View more
03-31-2017
08:10 AM
@Divakar Annapureddy it isn't working. I think the issue is "NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn2', 'node2:50070')], unknown_namenodes = [(u'nn1', 'node1:50070')]". When i manually shut down node1 the states of Nnamenode's doesn't change, as i seen in start logs.
... View more
03-31-2017
07:07 AM
@Namit Maheshwari As you asked: I have shut down the machine manually to test the high availability of cluster. The Zkfc server on that machine will automatically shut down if i shut down the machine. I am trying to restart name node or history server from oozie web ui, and i am using HDP-2.5 I think you are asking for Namenode start logs: 2017-03-30 20:50:20,863 - Waiting for the NameNode to broadcast whether it is Active or Standby...
2017-03-30 20:50:20,866 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://node1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpZwMQaz 2>/tmp/tmp8u5jXA''] {'quiet': False}
2017-03-30 20:52:28,369 - call returned (7, '')
2017-03-30 20:52:28,370 - Getting jmx metrics from NN failed. URL: http://node1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/jmx.py", line 38, in get_value_from_jmx
_, data, _ = get_user_call_output(cmd, user=run_user, quiet=False)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/functions/get_user_call_output.py", line 61, in get_user_call_output
raise Fail(err_msg)
Fail: Execution of 'curl -s 'http://node1:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem' 1>/tmp/tmpZwMQaz 2>/tmp/tmp8u5jXA' returned 7.
2017-03-30 20:52:28,371 - call['hdfs haadmin -ns NameNodeURI -getServiceState nn1'] {'logoutput': True, 'user': 'hdfs'}
17/03/30 20:52:50 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 0 time(s); maxRetries=45
17/03/30 20:53:10 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 1 time(s); maxRetries=45
17/03/30 20:53:30 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 2 time(s); maxRetries=45
.
.
17/03/30 21:07:11 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 43 time(s); maxRetries=45
17/03/30 21:07:31 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 44 time(s); maxRetries=45
Operation failed: Call From node2/10.10.2.82 to node1:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=node1/10.10.2.81:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
2017-03-30 21:07:51,283 - call returned (255, '17/03/30 20:52:50 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 0 time(s); maxRetries=45\n17/03/30 20:53:10 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 1 time(s); maxRetries=45\n17/03/30 20:53:30 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 2 time(s); maxRetries=45\n17/03/30 20:53:50 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 3 time(s); maxRetries=45\n17/03/30 20:54:10 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 4 time(s); maxRetries=45\n17/03/30
.
.
.
21:07:11 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 43 time(s); maxRetries=45\n17/03/30 21:07:31 INFO ipc.Client: Retrying connect to server: node1/10.10.2.81:8020. Already tried 44 time(s); maxRetries=45\nOperation failed: Call From node2/10.10.2.82 to node1:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=node1/10.10.2.81:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout')
2017-03-30 21:07:51,284 - call['ambari-sudo.sh su hdfs -l -s /bin/bash -c 'curl -s '"'"'http://node2:50070/jmx?qry=Hadoop:service=NameNode,name=FSNamesystem'"'"' 1>/tmp/tmpYw7oMN 2>/tmp/tmpk0tfnh''] {'quiet': False}
2017-03-30 21:07:51,544 - call returned (0, '')
2017-03-30 21:07:51,547 - NameNode HA states: active_namenodes = [], standby_namenodes = [(u'nn2', 'node2:50070')], unknown_namenodes = [(u'nn1', 'node1:50070')]
2017-03-30 21:07:51,547 - Will retry 4 time(s), caught exception: No active NameNode was found.. Sleeping for 5 sec(s)
And this will got repeated and start script timedout
... View more
03-30-2017
05:09 PM
I have a 3 node cluster with high availability for Name Node. When I shut down one of the two machine having name node instance and trying to restart active name node it failed with error Getting jmx metrics from NN failed. When debugging i noticed that the start script make jmx request from each name node to get state of the node multiple times and finally end with error python script has been killed due to timeout after waiting 1800 secs
... View more
Labels:
- Labels:
-
Apache Hadoop
10-20-2016
02:55 PM
When running a spark job with oozie workflow failed while running with spark submit (in cluster mode) works fine. There are two type of error in container logs. ERROR-1 : Containers created on node which is running spark history server ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from hostname/node-ip:55963 is closed Exception in thread "main" java.io.IOException: Connection from hostname/node-ip:55963 closed at org.apache.spark.network.client.TransportResponseHandler.channelUnregistered(TransportResponseHandler.java:124) at org.apache.spark.network.server.TransportChannelHandler.channelUnregistered(TransportChannelHandler.java:94) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) at io.netty.channel.ChannelInboundHandlerAdapter.channelUnregistered(ChannelInboundHandlerAdapter.java:53) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelUnregistered(AbstractChannelHandlerContext.java:158) at io.netty.channel.AbstractChannelHandlerContext.fireChannelUnregistered(AbstractChannelHandlerContext.java:144) at io.netty.channel.DefaultChannelPipeline.fireChannelUnregistered(DefaultChannelPipeline.java:739) at io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:659) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) and WARN TransportChannelHandler: Exception in connection from /node-ip:37174 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from hostname/node-ip:37174 is closed WARN CoarseGrainedExecutorBackend: An unknown (hostname:37174) driver disconnected. WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(2,[Lscala.Tuple2;@203ae34b,BlockManagerId(2, hostname, 53633))] in 1 attempts java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM INFO DiskBlockManager: Shutdown hook called INFO ShutdownHookManager: Shutdown hook called ERROR-2 Containers created on all other nodes ERROR TransportClient: Failed to send RPC 5136988924558751149 to /node-ip:37174: java.lang.AbstractMethodError: org.apache.spark.network.protocol.MessageWithHeader.touch(Ljava/lang/Object;)Lio/netty/util/ReferenceCounted;
java.lang.AbstractMethodError: org.apache.spark.network.protocol.MessageWithHeader.touch(Ljava/lang/Object;)Lio/netty/util/ReferenceCounted;
at io.netty.util.ReferenceCountUtil.touch(ReferenceCountUtil.java:73)
at io.netty.channel.DefaultChannelPipeline.touch(DefaultChannelPipeline.java:107)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:796)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:709)
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:111)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:724)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:716)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:802)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:709)
at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:724)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:716)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:36)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1064)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1111)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1049)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:339)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:393)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
at java.lang.Thread.run(Thread.java:745) Exception in thread "main" java.io.IOException: Failed to send RPC 5136988924558751149 to /node-ip:37174: java.lang.AbstractMetho
dError: org.apache.spark.network.protocol.MessageWithHeader.touch(Ljava/lang/Object;)Lio/netty/util/ReferenceCounted;
at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:239)
at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:226)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:518)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:511)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.notifyListenersWithStackOverFlowProtection(DefaultPromise.java:431)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:126)
at io.netty.channel.AbstractChannelHandlerContext.notifyOutboundHandlerException(AbstractChannelHandlerContext.java:821)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:726)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:716)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:802)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:709)
at io.netty.handler.timeout.IdleStateHandler.write(IdleStateHandler.java:284)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:724)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:716)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:36)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1064)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1111)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1049)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:339)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:393)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:742)
at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.AbstractMethodError: org.apache.spark.network.protocol.MessageWithHeader.touch(Ljava/lang/Object;)Lio/netty/util/ReferenceCounted;
at io.netty.util.ReferenceCountUtil.touch(ReferenceCountUtil.java:73)
at io.netty.channel.DefaultChannelPipeline.touch(DefaultChannelPipeline.java:107)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:796)
at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:709)
at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:111)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:724)
... 14 more
... View more
Labels:
- Labels:
-
Apache Oozie
-
Apache Spark
-
Apache YARN