About donigrubbs

donigrubbs · ‎05-15-2019

Before the failure to open message is this block (only including first line of java stackstrace): 2019-05-14 15:55:53,356 INFO org.apache.hadoop.hbase.regionserver.HRegion: Replaying edits from hdfs://athos/hbase/data/default/deveng_v500/ab693aebe203bc8781f1a9f1c0a1d045/recovered.edits/0000000000094270192 2019-05-14 15:55:53,383 INFO org.apache.hadoop.hbase.regionserver.HRegion: Replaying edits from hdfs://athos/hbase/data/default/deveng_v500/ab693aebe203bc8781f1a9f1c0a1d045/recovered.edits/0000000000094270299 2019-05-14 15:55:53,722 INFO org.apache.hadoop.hbase.regionserver.HRegion: Replaying edits from hdfs://athos/hbase/data/default/deveng_v500/ab693aebe203bc8781f1a9f1c0a1d045/recovered.edits/0000000000094270330 2019-05-14 15:55:53,903 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Auth successful for tomcat (auth:SIMPLE) 2019-05-14 15:55:53,904 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Connection from 10.190.158.151 port: 60648 with unknown version info 2019-05-14 15:55:54,614 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=deveng_v500,\x00\x00\x1C\xAB\x92\xBC\xD8\x02,1544486155414.ab693aebe203bc8781f1a9f1c0a1d045., starting to roll back the global memstore size. java.lang.IllegalArgumentException: offset (8) + length (2) exceed the capacity of the array: 0 at org.apache.hadoop.hbase.util.Bytes.explainWrongLengthOrOffset(Bytes.java:631) .............

donigrubbs · ‎05-14-2019

2019-05-14 09:18:26,042 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=deveng_v500,\x00\x00\x1C\xAB\x92\xBC\xD8\x02,1544486155414.ab693aebe203bc8781f1a9f1c0a1d045., starting to roll back the global memstore size. 2019-05-14 09:18:26,043 INFO org.apache.hadoop.hbase.coordination.ZkOpenRegionCoordination: Opening of region {ENCODED => ab693aebe203bc8781f1a9f1c0a1d045, NAME => 'deveng_v500,\x00\x00\x1C\xAB\x92\xBC\xD8\x02,1544486155414.ab693aebe203bc8781f1a9f1c0a1d045.', STARTKEY => '\x00\x00\x1C\xAB\x92\xBC\xD8\x02', ENDKEY => '\x00\x00L\xC6\xAD\xD1\x04'} failed, transitioning from OPENING to FAILED_OPEN in ZK, expecting version 40 2019-05-14 09:18:31,562 ERROR org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed open of region=deveng_v500,\x00\x00\x1C\xAB\x92\xBC\xD8\x02,1544486155414.ab693aebe203bc8781f1a9f1c0a1d045., starting to roll back the global memstore size. 2019-05-14 09:18:31,562 INFO org.apache.hadoop.hbase.coordination.ZkOpenRegionCoordination: Opening of region {ENCODED => ab693aebe203bc8781f1a9f1c0a1d045, NAME => 'deveng_v500,\x00\x00\x1C\xAB\x92\xBC\xD8\x02,1544486155414.ab693aebe203bc8781f1a9f1c0a1d045.', STARTKEY => '\x00\x00\x1C\xAB\x92\xBC\xD8\x02', ENDKEY => '\x00\x00L\xC6\xAD\xD1\x04'} failed, transitioning from OPENING to FAILED_OPEN in ZK, expecting version 58 The region is stuck trying to open on different region servers. I've cycled the nodes to force it to attempt to online elsewhere since the move command doesn't do anything, but no luck. fsck is clean, but hbck with -fixAssignments can't online the region. 19/05/14 09:20:22 WARN util.HBaseFsck: Skip region 'deveng_v500,\x00\x00\x1C\xAB\x92\xBC\xD8\x02,1544486155414.ab693aebe203bc8781f1a9f1c0a1d045.' 19/05/14 09:20:22 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService 19/05/14 09:20:22 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x16a31adec7becec 19/05/14 09:20:22 INFO zookeeper.ZooKeeper: Session: 0x16a31adec7becec closed 19/05/14 09:20:22 INFO zookeeper.ClientCnxn: EventThread shut down Exception in thread "main" java.io.IOException: 1 region(s) could not be checked or repaired. See logs for detail.

donigrubbs · ‎02-15-2017

@saranvisa: https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_fixed_in_55.html#fixed_issues_555 https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_vd_cdh_download_55.html#cdh_555

donigrubbs · ‎02-15-2017

Turned out that the nodes were in the excludes files, just not the host.exclude like we use in CDH5, so it was missed.

donigrubbs · ‎02-15-2017

We upgraded our clusters from 5.5.2 to 5.5.5 a while ago. We've since identified a few nodes where the alternatives are still referencing the 5.5.2 parcel. root@use542ytb9:~ ( use542ytb9 ) 13:15:15 $ which hbase /usr/bin/which: no hbase in (/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/sbin:/usr/sbin:/usr/local/sbin:/root/bin) root@use542ytb9:~ ( use542ytb9 ) root@use542ytb9:~ ( use542ytb9 ) 13:15:18 $ ls /usr/bin/hbase /usr/bin/hbase root@use542ytb9:~ ( use542ytb9 ) 13:15:24 $ ll /usr/bin/hbase lrwxrwxrwx 1 root root 23 May 16 2016 /usr/bin/hbase -> /etc/alternatives/hbase root@use542ytb9:~ ( use542ytb9 ) 13:15:28 $ ll /etc/alternatives/hbase lrwxrwxrwx 1 root root 63 May 16 2016 /etc/alternatives/hbase -> /opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p1426.1277/bin/hbase root@use542ytb9:~ ( use542ytb9 ) 13:15:30 $ ls /opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p1426.1277/bin/hbase ls: cannot access /opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p1426.1277/bin/hbase: No such file or directory root@use542ytb9:~ ( use542ytb9 ) We've cycled the cm agent, done full decommissions and recommisisons, rebooted the nodes, and deployed client config. Since we've identified 3 nodes, we're assuming there's others as well. The hadoop services still run on these nodes, but we're unable to run hdfs, hbase, or yarn commands, which has also caused several mapreduce jobs to fail. Is there a good way to repoint these alternatives to the new parcel?

donigrubbs · ‎02-15-2017

Also, to just go over what we've attempted, we've cycled the datanode (or at least attempted to), rebooted the node, and since we found HDFS-1106 where someone had the same issue, did a refresh, but still can't get it to start.

donigrubbs · ‎02-15-2017

On HDFS 0.20.2, yes, it's old, 2 datanodes in our prod cluster no longer can start up. The namenode says: 2017-02-15 09:24:52,861 FATAL org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getDatanode: Data node cernsrchhadoop504.cernerasp.com:50010 is attempting to report storage ID DS-1574636665-44.128.6.253-50010-1461251397876. Node 44.128.6.253:50010 is expected to serve this storage. 2017-02-15 09:24:52,862 INFO org.apache.hadoop.ipc.Server: IPC Server handler 58 on 9000, call register(DatanodeRegistration(cernsrchhadoop504.cernerasp.com:50010, storageID=DS-1574636665-44.128.6.253-50010-1461251397876, infoPort=50075, ipcPort=50020)) from 44.128.6.253:51326: error: org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node cernsrchhadoop504.cernerasp.com:50010 is attempting to report storage ID DS-1574636665-44.128.6.253-50010-1461251397876. Node 44.128.6.253:50010 is expected to serve this storage. org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node cernsrchhadoop504.cernerasp.com:50010 is attempting to report storage ID DS-1574636665-44.128.6.253-50010-1461251397876. Node 44.128.6.253:50010 is expected to serve this storage. The kicker though, is that it's saying that datanode cernsrchhadoop504 can't serve that storage, as it's expected to be served by 44.128.6.253, which is actually cersnrchhadoop504 SFrom the namenode: root@cernsrchhadoop388.cernerasp.com:~ ( cernsrchhadoop388.cernerasp.com ) 09:28:10 $ nslookup 44.128.6.253 Server: 127.0.0.1 Address: 127.0.0.1#53 Non-authoritative answer: 253.6.128.44.in-addr.arpa name = cernsrchhadoop504.cernerasp.com. Datanode logs are saying similar on 504 2017-02-15 09:24:52,866 ERROR datanode.DataNode (DataNode.java:main(1372)) - org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.UnregisteredDatanodeException: Data node cernsrchhadoop504.cernerasp.com:50010 is attempting to report storage ID DS-1574636665-44.128.6.253-50010-1461251397876. Node 44.128.6.253:50010 is expected to serve this storage. So for the question, how can I get the namenode to realize that the node it is expecting to have that storage is actually the same node that's attempting to serve that storage?

donigrubbs · ‎07-26-2016

Our reports manager is currently using 19G of 24G on /var, all in /var/lib. More specifically: /var/lib/cloudera-scm-headlamp/cloudera-scm-headlamp We have 9 <cluster>-hdfs directories here which are taking up most of the space. The contents are mostly fsimage.tmp. Some of these clusters no longer exist, and we have other live clusters that aren’t here. The timestamps are from Mar 22. Why is the reports manager keeping fsimage.tmp files, what are some of the configs we can use to manage these, and what’s the main goal (meaning if it’s meant as a backup strategy, why a one time copy instead of continual)?

donigrubbs · ‎07-08-2016

We have a CM instance that's currently administering 1,340 nodes. From a prior discussion with another team at Cloudera, it came out that CM is only meant to administer 1,000 nodes in it's current form. Due to that, we're looking at splitting out our clusters to another CM instance, but that's the long term plan. For the short term, CM has become very sluggish. A few examples is that, now, if you go to the hosts page in CM, to show all hosts, chrome will say that the tab becomes unresponsive about 2/3 of the time, just because the tab almost never loads. We're unable to go back much more than 2 pages in the command history, and whenever you give a command, such as cycle service, we typically see a 2-4 second delay. We've already done some tuning on our instances. We started with 8gb memory and 2 cores. We've sized up to 30gb memory and 6 cores. We also initially had 4 vm's, with all vm's running multiple services. We added 2 more vm's so we could isolate the host monitor and service monitor, since these two services seemed to be the heaviest used services. We're currently sitting at 18gb xms and xmx options on our main cm server service, with 3gb xmn. Our other services typically sit between 8-12gb heap. What other tuning options are recommended to improve performance?

donigrubbs · ‎03-03-2016

Yep, that did it. Didn't realize the id's were event id's and not host id's. Used the attributes.HOST_IDS and was able to pull back the information for the host. With this output, I can sort and build alerting off it. Thank you.

Online	Offline
Last Visited	‎03-16-2021 01:06 PM

Member Since	‎08-01-2014 09:01 AM
Last Visited	‎03-16-2021 01:06 PM
Posts	16

Cloudera Community

Re: UnregisteredDatanodeException on same node wit...

Re: Region stuck in transition, various versions e...

Region stuck in transition, various versions expec...

Re: CDH upgrade alternatives link not updated

Re: UnregisteredDatanodeException on same node wit...

CDH upgrade alternatives link not updated

Re: UnregisteredDatanodeException on same node wit...

UnregisteredDatanodeException on same node with sa...

Reports Manager /var/lib fsimage.tmp

Tuning suggestions for Cloudera Manager

Re: Why is node in concerning status in CM using A...