Created 03-01-2017 03:01 PM
Hello Hortonworks Community,
I'm having some issues with two of my nodemanagers on a 4 node cluster. This cluster is running on CentOS 7 with HDP 2.5. I noticed 2/4 nodemanagers being started so my first attempt to resolve the situation was to start the two nodemanagers from the ambari front end. After starting both nodemanagers the same number was being reported: 2/4 started. Then, I tried a second possible solution. I removed the two nodemanagers that did not start and reinstalled them. This did not work either. I am looking at the log and this is the reason for the failed start: (/var/log/hadoop-yarn/yarn/yarn-yarn-nodemanager-<FQDN>.log)
2017-03-01 09:51:05,115 INFO service.AbstractService (AbstractService.java:noteFailure(272)) - Service NodeManager failed in state INITED; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:178) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:220) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:546) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:594) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:966) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:953) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:200) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2017-03-01 09:51:05,116 FATAL nodemanager.NodeManager (NodeManager.java:initAndStartNodeManager(549)) - Error starting NodeManager org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartRecoveryStore(NodeManager.java:178) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:220) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:546) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:594) Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: checksum mismatch at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200) at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218) at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.openDatabase(NMLeveldbStateStoreService.java:966) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMLeveldbStateStoreService.initStorage(NMLeveldbStateStoreService.java:953) at org.apache.hadoop.yarn.server.nodemanager.recovery.NMStateStoreService.serviceInit(NMStateStoreService.java:200) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) ... 5 more 2017-03-01 09:51:05,120 INFO nodemanager.NodeManager (LogAdapter.java:info(45)) - SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NodeManager at <FQDN>/<IP> ************************************************************/
Does anyone have any ideas on how to resolve this problem?
Thanks,
-Jose
Created 03-04-2017 09:51 AM
Hi Jose,
Maybe a sst file got corrupt can you try to remove the folder of /var/log/hadoop-yarn/nodemanager/recovery-state from failed nodemanagers and check if starts?
These files stays in the system even if you decomission the nodes.
.
Created 03-02-2017 03:05 PM
Hi Jose,
From error log message I found that it's because of checksum mismatch. Please refer below links. Hope it will work.
2. https://issues.apache.org/jira/browse/HDFS-6804
Thanks,
Mahesh
Created 03-02-2017 08:35 PM
Thank you for the answer; however, both are not for levelDB, which is used in node manager.
Do you have any idea to initialize levelDB. I try to find it, but i can't find any good article.
Created 03-04-2017 09:51 AM
Hi Jose,
Maybe a sst file got corrupt can you try to remove the folder of /var/log/hadoop-yarn/nodemanager/recovery-state from failed nodemanagers and check if starts?
These files stays in the system even if you decomission the nodes.
.
Created 03-06-2017 03:00 PM
Hey Juan,
Thanks for this answer. This actually did fix the nodemanager situation.