Member since
08-16-2016
48
Posts
9
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5120 | 12-28-2018 10:21 AM | |
6089 | 08-28-2018 10:58 AM | |
3360 | 10-18-2016 11:08 AM | |
3984 | 10-16-2016 10:13 AM |
05-26-2017
01:21 PM
It looks like an issue described in https://issues.apache.org/jira/browse/HDFS-11254 ( Standby NameNode may crash during failover if loading edits takes too long) 2017-05-25 14:40:37,740 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: replaying edit log: 137515/140868 transactions completed. (98%) 2017-05-25 14:41:27,207 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Loaded 140868 edits starting from txid 20804872532 It took 50 seconds to load edits. Edit log loading must acquire namenode lock and ZKFC may fail to be establish connection with NameNode. at org.apache.hadoop.hdfs.server.namenode.NameNode.transitionToActive(NameNode.java:1640) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.transitionToActive(NameNodeRpcServer.java:1375) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToActive(HAServiceProtocolServerSideTranslatorPB.java:107) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4460) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) Looks like it happened during transition from standby to active. It may be fixed by HDFS-8865 Improve quota initialization performance I suspect the stackdump in the log is not complete. If it's induced by HDFS-8865, you would see stacktrace like: Thread 188 (IPC Server handler 25 on 8022):
State: RUNNABLE
Blocked count: 278
Waited count: 17419
Stack:
org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:886)
org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuotaRecursively(FSImage.java:887)
org.apache.hadoop.hdfs.server.namenode.FSImage.updateCountForQuota(FSImage.java:875)
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:860)
org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:827)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:232)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:188)
org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$1.run(EditLogTailer.java:182)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:415)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
org.apache.hadoop.security.SecurityUtil.doAsUser(SecurityUtil.java:477)
org.apache.hadoop.security.SecurityUtil.doAsLoginUser(SecurityUtil.java:458) A workaround is to increase ZKFC connection timeout value. The default is 45 seconds IIRC. Double this number should alleviate the problem.
... View more
05-09-2017
06:23 AM
1 Kudo
Based on the error message, it comes from org.apache.hadoop.ipc.Server#checkDataLength() Fundamentally, this property changes the max length of protobuf (a widely used data exchange format), and there's a reason why there needs a size limit. Excerpt from protobuf doc: https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/CodedInputStream#setSizeLimit-int- public int setSizeLimit(int limit) Set the maximum message size. In order to prevent malicious messages from exhausting memory or causing integer overflows, CodedInputStream limits how large a message may be. The default limit is 64MB. You should set this limit as small as you can without harming your app's functionality. Note that size limits only apply when reading from an InputStream, not when constructed around a raw byte array (nor with ByteString.newCodedInput()). You could increase this limit, but there are other Hadoop limits that you could also hit. For example, number of files in a directory. In summary, you should go back and check what went over the limit. It can be number of files in a directory, number of blocks on a DataNode, ... and so on. It is an indication that something went over the recommended range.
... View more
04-11-2017
02:34 AM
If restarting NameNode doesn't help, see if you can bump NameNode log level to DEBUG and post the NameNode log (or you can send that to me privately weichiu at cloudera dot com)
... View more
04-11-2017
02:33 AM
Can you try to restart NameNode and see if it helps? The symptom matches HDFS-10788: https://issues.apache.org/jira/browse/HDFS-10788 and I initially thought HDFS-10788 is resolved via HDFS-9958, but apparently that's not the case.
... View more
04-06-2017
07:40 AM
use lsof command, and you should be able to see all the open files
... View more
04-06-2017
07:31 AM
Got it. The warning message "Inconsistent number of corrupt replicas" suggests you may have encountered the bug described in HDFS-9958 (BlockManager#createLocatedBlocks can throw NPE for corruptBlocks on failed storages.) HDFS-9958 is fixed in a number of CDH versions: CDH5.5.6 CDH5.7.4 CDH5.7.5 CDH5.7.6 CDH5.8.2 CDH5.8.3 CDH5.8.4 CDH5.9.0 CDH5.9.1 CDH5.10.0 CDH5.10.1 Unfortunately, given that you're already on CDH5.10.0, it appears to be a new bug that gives this symptom. I can file an Apache Hadoop jira on your behalf for this bug report. The Cloudera Community forum is supposed to be a troubleshooting site, and bug reports should be sent to Apache Hadoop so that more people can look into it.
... View more
04-06-2017
07:10 AM
diskbalanacer is a new feature in CDH5.8, and by definition, a new feature will not be backported to an older minor version.
... View more
04-04-2017
06:21 AM
Hi, It appears to be a bug and I am interested to understand this bug further. I did a quick search and it doesn't seem to be reported previous on Apache Hadoop Jira. Would you be able to look at the Active NameNode log and search for ArrayIndexOutOfBoundsException The client side of log doesn't print its stack trace so it's impossible to know where this exception was thrown. NameNode log should likely contain the entire stacktrace, which will help finding where it originated.
... View more
10-18-2016
11:08 AM
1 Kudo
I feel what you described has its own inherent risk. Since CDH5.8.2, you can use a new HDFS feature: intra datanode balancer to do exactly what you asked for. And we have a new blog post about this feature: http://blog.cloudera.com/blog/2016/10/how-to-use-the-new-hdfs-intra-datanode-disk-balancer-in-apache-hadoop/
... View more
10-16-2016
10:13 AM
1 Kudo
Hi, I don't think that's possible given that most applications are based on HDFS semantics (strong consistency, POSIX compatible), and S3 simply isn't designed as a file system (eventual consistency, blob store). Plus, you lose data locality. As far as I know, most cloud use cases still use HDFS as temporary, intermediate storage, and use S3 as permanent, eventual storage. There've been several studies in using HDFS as meta store, and cloud as data store, but that's a huge work (see HDFS-9806) and probably in the Hadoop 4/CDH 7 timeframe. Hope this helps.
... View more
- « Previous
- Next »