Support Questions

zack_riesland · ‎08-29-2016

Our cluster recently had some issue related to network outages.

When all the dust settled, Hbase eventually "healed" itself, and almost everything is back to working well, with a couple of exceptions.

In particular, we have one table where almost every query times out - which was never the case before. It's very small compared to most of our other tables at around 400 million rows.

(Clarification: we query via JDBC via Phoenix)

When I look at the GUI tools (like http://<my server>:16010/master-status#storeStats) it shows '1' under "offline regions" for that table (it has 33 total regions). Almost all the other tables show '0'.

Can anyone help me troubleshoot this?

I know there is a CLI tool for fixing HBase issues. I'm wondering whether that "offline region" is the cause of these timeouts.

If not, how I can I figure it out?

Thanks!

elserj · ‎08-29-2016

Typically, the following is sufficient to automatically repair offline/in-transition regions:

hbase hbck -repair

However, without logs, it's rather impossible to know why the Region is offline. There may be explanations in the Master log as to why this Region is not getting assigned.

zack_riesland · ‎08-29-2016

Thanks Josh. I'm trying that now...

sjiang · ‎08-29-2016

Please use 'hbase hbck' first. The region might be legally offline (eg. parent region of the splitted region).

zack_riesland · ‎08-29-2016

I ran the tool and it moved the '1' from 'offline regions' to 'failed regions'.

The output of hbck: Exception in thread "main" java.io.IOException: 2 region(s) could not be checked or repaired.

The interesting piece of the hbase master log looks like this after a failed query:

	2016-08-29 12:44:35,810 WARN  [AM.ZK.Worker-pool2-t121] master.RegionStates: Failed to open/close a97029c18889b3b3168d11f910ef04ae on XXX009.network,16020,1472143382923, set to FAILED_OPEN
2016-08-29 12:44:35,900 WARN  [AM.ZK.Worker-pool2-t106] master.RegionStates: Failed to open/close fad4e0e460099b5a0345b9ec354d0117 on XXX003.network,16020,1472143374416, set to FAILED_OPEN
2016-08-29 12:44:36,143 WARN  [AM.ZK.Worker-pool2-t115] master.RegionStates: Failed to open/close 5ace750e16bcddf3ab29814da9a4f641 on XXX002.network,16020,1472143382124, set to FAILED_OPEN
2016-08-29 12:44:36,889 WARN  [AM.ZK.Worker-pool2-t113] master.RegionStates: Failed to open/close a10e94e0a64a9b69a540603d6c9aee75 on XXX012.network,16020,1472143381417, set to FAILED_OPEN

elserj · ‎08-29-2016

You should be able to cross-reference the Region IDs with those region servers (nodes 9, 3, 2, and 12, respectively) to determine why they were left in a FAILED_OPEN state.

zack_riesland · ‎08-29-2016

This is what I get in the logs of one of the region servers mentioned in the stack trace from the master:

MY_BROKEN_TABLE/8a444fa1979524e97eb002ce8aa2d7aa/0/4f9a5c26ddb0413aa4eb64a869ab4a2c
at org.apache.hadoop.hdfs.BlockReaderFactory.requestFileDescriptors(BlockReaderFactory.java:591)
at org.apache.hadoop.hdfs.BlockReaderFactory.createShortCircuitReplicaInfo(BlockReaderFactory.java:490)
at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.create(ShortCircuitCache.java:782)
at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache.fetchOrCreate(ShortCircuitCache.java:716)
at org.apache.hadoop.hdfs.BlockReaderFactory.getBlockReaderLocal(BlockReaderFactory.java:422)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:333)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:656)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:199)
at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1407)
at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1677)
at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1504)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:441)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.seekTo(HFileReaderV2.java:1249)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:267)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:169)
at org.apache.hadoop.hbase.regionserver.StoreScanner.seekScanners(StoreScanner.java:363)
at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:281)
at org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:243)
at org.apache.hadoop.hbase.regionserver.compactions.Compactor.createScanner(Compactor.java:342)
at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:88)
at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:112)
at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1216)
at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1890)
at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.doCompaction(CompactSplitThread.java:525)
at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:562)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-08-29 05:36:42,959 INFO  [regionserver/XXX009.network/<ip address>:16020-shortCompactions-1472143410679] hdfs.DFSClient: Access token was invalid when connecting to /<ip address>:50010 : org.apache.hadoop.security.token.SecretManager$InvalidToken: access control error while attempting to set up short-circuit access to /apps/hbase/data/data/default/<DB NAme>.MY_BROKEN_TABLE/8a444fa1979524e97eb002ce8aa2d7aa/0/4f9a5c26ddb0413aa4eb64a869ab4a2c

sjiang · ‎08-29-2016

could you share out the version of the hbase you are using?

zack_riesland · ‎08-29-2016

HDP 2.4.2: Hbase 1.1.2.2.4.2.0-258

elserj · ‎08-29-2016

The InvalidToken error is not something to be worried about. A new Token will be fetched by the HDFS client automatically. You just see this scary-looking exception.

Cloudera Community

Support Questions

How to fix "offline regions" in HBase