Created 02-01-2018 12:04 PM
Hi,
After an accidental power-off, one of slave node in my cluster (include 3 nodes, one master and two slaves, slave01 falled) cannot be booted. It "contains a file system with errors, check forced" so I went through this solution using "fsck -f ...":
https://askubuntu.com/questions/955467/dev-sda1-contains-a-file-system-with-errors-check-forced
"fsck -f ..." fixed several files and the desktop came bcak.
However, after restart cloudera manager, theNameNode falled into safe mode. Then I turned off safe mode manually. Two errors came out: Missing_blocks + Under_replicated_blocks. They claimed that 99.999% of blocks in the cluster are missing. And 99.999% of blocks in the cluster are required to be replicated. If I restart CM the save mode comes back.
Then I checked the logs from the namenode and both datanodes:
In namenode log:
7:02:29.620 PM | WARN | Server | Requested data length 69250013 is longer than maximum configured RPC length 67108864. RPC came from 192.168.1.102 |
7:02:29.621 PM | INFO | Server | Socket Reader #1 for port 8022: readAndProcess from client 192.168.1.102 threw exception [java.io.IOException: Requested data length 69250013 is longer than maximum configured RPC length 67108864. RPC came from 192.168.1.102] java.io.IOException: Requested data length 69250013 is longer than maximum configured RPC length 67108864. RPC came from 192.168.1.102 at org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1610) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1672) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:896) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:752) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:723) |
7:02:30.167 PM | WARN | Server | Requested data length 69251091 is longer than maximum configured RPC length 67108864. RPC came from 192.168.1.103 |
7:02:30.167 PM | INFO | Server | Socket Reader #1 for port 8022: readAndProcess from client 192.168.1.103 threw exception [java.io.IOException: Requested data length 69251091 is longer than maximum configured RPC length 67108864. RPC came from 192.168.1.103] java.io.IOException: Requested data length 69251091 is longer than maximum configured RPC length 67108864. RPC came from 192.168.1.103 at org.apache.hadoop.ipc.Server$Connection.checkDataLength(Server.java:1610) at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1672) at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:896) at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:752) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:723) |
in datanode slave02 which didn't fall:
7:36:16.878 PM | INFO | DataNode | Unsuccessfully sent block report 0x11a70b7faba74214, containing 1 storage report(s), of which we sent 0. The reports had 5918818 total blocks and used 0 RPC(s). This took 283 msec to generate and 106 msecs for RPC and NN processing. Got back no commands. |
7:36:16.878 PM | WARN | DataNode | IOException in offerService java.io.EOFException: End of File Exception between local host is: "slave02/192.168.1.103"; destination host is: "master":8022; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.GeneratedConstructorAccessor9.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1508) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy23.blockReport(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:204) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:323) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:561) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:695) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006) |
in datanode slave01 which falled:
7:38:47.747 PM | INFO | DataNode | Unsuccessfully sent block report 0x519b781f0b8dd8ed, containing 1 storage report(s), of which we sent 0. The reports had 5918715 total blocks and used 0 RPC(s). This took 496 msec to generate and 100 msecs for RPC and NN processing. Got back no commands. |
7:38:47.747 PM | WARN | DataNode | IOException in offerService java.io.EOFException: End of File Exception between local host is: "slave01/192.168.1.102"; destination host is: "master":8022; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException at sun.reflect.GeneratedConstructorAccessor9.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1508) at org.apache.hadoop.ipc.Client.call(Client.java:1441) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy23.blockReport(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:204) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:323) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:561) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:695) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006)
|
The above events repeat all the time.
As my understanding the system wants to replicate the missing block but it cannot success because:
Requested data length 69250013 (this number is a variate of block) is longer than maximum configured RPC length 67108864.
I checked online and someone says I should change a configuration "ipc.maximum.data.length" in core-default.xml (https://community.hortonworks.com/questions/101841/issue-requested-data-length-146629817-is-longer-t...)
But I'm using CDH 5.13 with hadoop 2.6. "ipc.maximum.data.length" is introduced from hadoop 2.8. So I can't find it in CM configuration pages.
Can I add this property myself to somewhere for the namenode? or for the entire hdfs? How and where can I add it?
Then I find another similar asked in our community from RakeshE: https://community.cloudera.com/t5/Storage-Random-Access-HDFS/ISSUE-Requested-data-length-146629817-i...
The solution given by weichiu says the problem cannot be solved by adjust "ipc.maximum.data.length". We should delete small files to decrease block count and balance. I also have around 6 Million blocks. But I should firstly have the ability to read and write them out before I can delete them.
Please give me some suggestions on what should I do to fix the cluster. Thanks in advance!
Created 02-04-2018 11:30 PM
Hi,
Did you tried adding ipc parameter in core-site.xml file:
<property>
<name>ipc.maximum.data.length</name>
<value>134217728</value>
</property>
Then restart your agent.
Created 10-29-2018 12:51 PM