Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

hbase hbck reports inconsistency immediately after adding hbase service

avatar
Explorer

Hi everyone!

 

So a reeeally long story short (I can gladly expand upon request) - I added the hbase service, did hbase hbck immediately after this and it already detected one inconsistency:

 

ERROR: Region { meta => hbase:namespace,,1485505125654.b972bf2653eaa96104d6034591386a60., 
hdfs => null, deployed => hadoop-34.xxxzzz.de,60020,1485505116059;hbase:namespace,,
1485505125654.b972bf2653eaa96104d6034591386a60., replicaId => 0 }
found in META, but not in HDFS, and deployed on hadoop-34.xxxzzz.de,60020,1485505116059

When I do hbase hbck -repairHoles the inconsistency is gone, BUT... so is my hbase:namespace table.

 

hbase(main):001:0> scan 'hbase:namespace'
ROW                                    COLUMN+CELL 

ERROR: Unknown table hbase:namespace!

Interestingly enough, not gone from HDFS:

 

hdfs dfs -ls /hbase/data/hbase
Found 2 items
drwxr-xr-x   - hbase hbase          0 2017-01-27 09:18 /hbase/data/hbase/meta
drwxr-xr-x   - hbase hbase          0 2017-01-27 09:18 /hbase/data/hbase/namespace

...nor from the the zookeeper:

 

[zk: localhost:2181(CONNECTED) 2] ls /hbase/table
[hbase:meta, hbase:namespace]

...and an interesting side effect is that create_namespace function of the hbase shell is now gone:

 

hbase(main):003:0> create_namespace 'ns1'

ERROR: Unknown table ns1!

 

I did find this ray of hope: HBASE-16294 and this is actually included in latest CDH (I am running 5.9.0 btw).

 

But!

 

This seems to concern only replicas. This is the patch code btw:

 

if (hbi.getReplicaId() == HRegionInfo.DEFAULT_REPLICA_ID) {
 // Log warning only for default/ primary replica with no region dir
 LOG.warn("No HDFS region dir found: " + hbi + " meta=" + hbi.metaEntry);
}

 

I have replication disabled, and as one can see from the error message: 

replicaId => 0

 

Now, I would have let this slide, but the real problem is that over time I get a huge number of these inconsistencies and attempt to fix them results in not being able to find tables from hbase shell.

 

Any ideas would be greatly appreciated!

1 ACCEPTED SOLUTION

avatar
Master Collaborator

crazy thought: does the node that you are running hbck on have the HDFS gateway role applied?

 

could be that hbck can't find the region in HDFS because it doesn't know how to connect to hdfs?

another way to verify would be to check the hdfs location for the hbase tables:

 

/hbase/data/default/<table>

View solution in original post

11 REPLIES 11

avatar
Explorer

Just a quick update, the issue is still present after upgrading to CDH 5.10.0, so... if you sort of had an idea, but were kind of shy or thought "naaaaah, he probably already thought of that", I strongly encourage you to step forward 🙂

avatar
Master Collaborator

Here's what i suspect happened, will call out supposition.

when you start the hbase service, first the master starts up, finds the meta region in hdfs, waits for the regionserver list to settle, and then assigns meta.

 

then it onlines all regions defined in meta (including hbase:namespace) in the case of initial startup it would make sense to me that it would online with the namespace configured and then flush to disk. (supposition)

 

if you do a hbck after meta is online, but before it can flush the namespace it will find them as holes. This is because hbck can only do guesswork based on the current state of HDFS or regionserver or zookeeper.

 

ALL hbck operations except for hbck and hbck -fixAssignments are dangerous. and fixAssignments isn't always perfect at fixing assignments but unless there is another bug encountered, it is not destructive.

what -repairHoles does is create an EMPTY region in the place of the region that was now gone. This is so that you can at least salvage what is left in the case of a disaster.

 

It's possible that hbase then sees that the namespace regionfile exists and then will not flush the namespace table. (supposition)

I'd suggest just removing and then re-adding the hbase service (and delete the remnants in hdfs and zookeeper in between those two steps if need be) 

avatar
Explorer

Hi Ben,

 

Thanks for your response, much appreciated!

 

Actually, that is exactly what I did. I messed it up so bad that I had to delete the service. All I described above actually happened after I:

1. Stopped service

2. Deleted the service

3. hdfs -rm -r /hbase

4. echo "rmr /hbase" | zookeeper-client

5. added the service again

 

At this time incosistencies are piling up, I have 34 of them, and the one described above, found in the namespace table is still there.

 

avatar
Master Collaborator

very interesting!

So hbck says it's in Hbase META but not in HDFS?  perhaps there is a HDFS permissions issue for the hbase user? (assumption being that hbase is able to start, but not write the data it needs to HDFS, yet somehow still lives enough to stay running in that weird state.)

 

 

avatar
Explorer

Hm. I'll dig into the permissions issue. However, I'm I doubt that this is the reason behind this weirdness, because, not only that hmaster lives, but on the surface, it appears to be functioning normally. I created a table and filled it with 2 million rows. Did hbase hbck after this. It reported 37 inconsistencies of the same type. hbase:namespace still among them.

 

EDIT:

 

Additional info: When i scan 'hbase:namespace'

 

I get:

ROW                                    COLUMN+CELL                                                                                                   
 default                               column=info:d, timestamp=1486140313224, value=\x0A\x07default                                                 
 hbase                                 column=info:d, timestamp=1486140313283, value=\x0A\x05hbase                                                   
2 row(s) in 0.3760 seconds

...as I should.

 

More additional info (don't know if relevant), before inconsistency error, in log I get  No HDFS region dir found. Looks like this:

 

No HDFS region dir found: { meta => hbase:namespace,,1486140310904.86d0405303ed58995e1507e33cbf66a2., 
hdfs => null, deployed => hadoop-38.xxxx.xxxxxxxxxxx.de,60020,1486140300341;hbase:namespace,,1486140310904.86d0405303ed58995e1507e33cbf66a2.,
replicaId => 0 } meta={ENCODED => 86d0405303ed58995e1507e33cbf66a2, NAME => 'hbase:namespace,,1486140310904.86d0405303ed58995e1507e33cbf66a2.',
STARTKEY => '', ENDKEY => ''}

It says basically the same thing as the error above, just with the additional hint of No HDFS region dir found and it's marked as warning. The deployed part also contains deployment info that I found in /hbase/WALs folder, namely:

 

hdfs dfs -ls /hbase/WALs
Found 16 items
...
drwxr-xr-x   - hbase hbase          0 2017-02-06 11:11 /hbase/WALs/hadoop-38.xxxx.xxxxxxxxxxx.de,60020,1486140300341
...

My next desperate idea is to try to read whatever it is in /hbase/data/hbase/namespace/86d0405303ed58995e1507e33cbf66a2/.regioninfo (following the No HDFS region dir found hint) as soon as I find some command line protobuf reader.

 

Again, thanks for taking the time to look into this, and as always ANY feedback is much appreciated!

 

Regards!

 

avatar
Master Collaborator

crazy thought: does the node that you are running hbck on have the HDFS gateway role applied?

 

could be that hbck can't find the region in HDFS because it doesn't know how to connect to hdfs?

another way to verify would be to check the hdfs location for the hbase tables:

 

/hbase/data/default/<table>

avatar
Explorer

So, the crazy thought you had resolved the days long mistery!

 

An expension for any poor soul that might encounter a similar issue:

 

In our setup, we have three machines reserved for master roles m[1-3] and 40 worker machines w[1-40]. I assigned 2 HBase Masters, HBase Thrift server and HBase REST server on m2 and m3, Region Servers on w[20-40].

 

I ran hbck from m1 that has no HBase roles on it. Normally, this would fail with ConnectException as it does from all other machines that don't have HBase roles on them w[1-19], because they don't have hbase-site.xml on them and don't know where to look for the zookeeper, so they fall back to default - localhost. However, As m1 is also Zookeeper leader, the localhost is actually an OK default and hbase hbck will work. Sort of. As, m1 doesn't have any HDFS roles either.

 

So, m1 was LITERALLY the only machine in the cluster that would report these inconsistencies. The check would either fail with connection exception or report everything being normal.

 

Many thanks for your time, it saved a bunch of mine (although I also already lost a lot of it :)).

avatar
Contributor

Additionally to the previous solution some best practices:

 

- hbck is basically just an HBase client command

- client commands are recommended to being run from nodes which has the relevant service's client configurations deployed on them. This can be done manually (not recommended, see later why) or via Cloudera Manager

 

According to these whichever node you are running hbck should have HBase client configs deployed to make sure that it actually uses the cluster's current configs (which have several configs, like heap size for client commands, Zookeeper ensemble hostnames, etc).
To have this done, it's recommended to deploy an HBase GATEWAY role[1] that actually does just this, deploys the active configs of HBase service via Cloudera Manager. Additionally if any HBase client config changes are made later via Cloudera Manager, those will be also delegated automatically just the same way as any config changes are delegated to every node which has HBase role instances installed on.

 

There are some further reference about using hbck here[2] as this is an advanced topic.

 

[1] - Gateway roles CDH latest version - https://www.cloudera.com/documentation/enterprise/latest/topics/cm_mc_managing_roles.html#managing_r...

[2] - Checking and Repairing HBase tables CDH5.15.x - https://www.cloudera.com/documentation/enterprise/5-15-x/topics/admin_hbase_hbck.html

(please note that in CDH6.0.0 hbck's several options are depreciated)

avatar
Rising Star

Hello,

 

Thanks for your responce, Where ever i am checking the HBCK on that particular server we have HBase master role and from Gateway node also i can see same error.

 

Any other suggessions wolud be appriciated.

 

Thanks.