Created 01-05-2016 01:17 PM
Hi Guys,
I have been testing out the Phoenix Local Indexes and I'm facing an issue after restart the entire HBase cluster.
Scenario: I'm using Ambari 2.1.2 and HDP 2.3 using Phoenix 4.4 and HBase 1.1.1. My test cluster contains 10 machines and the main table contains 300 pre-split regions which implies 300 regions on local index table as well. To configure Phoenix I'm following this tutorial.
When I start a fresh cluster everything is just fine, the local index is created and I can insert data and query it using the index. The problem comes when I need to restart the cluster to update some configurations in that moment I'm not able to restart the cluster anymore. Most of the servers have exceptions like this one which looks that they are getting into a state where some region servers are waiting from regions that are not available yet in other region servers. (Kinda of a deadlock)
INFO [htable-pool7-t1] client.AsyncProcess: #5, table=_LOCAL_IDX_BIDDING_EVENTS, attempt=27/350 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region _LOCAL_IDX_BIDDING_EVENTS,57e4b17e4b17e4ac,1451943466164.253bdee3695b566545329fa3ac86d05e. is not online on ip-10-5-4-24.ec2.internal,16020,1451996088952 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2898) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:947) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1991) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) on ip-10-5-4-24.ec2.internal,16020,1451942002174, tracking started null, retrying after=20001ms, replay=1ops INFO [ip-10-5-4-26.ec2.internal,16020,1451996087089-recovery-writer--pool5-t1] client.AsyncProcess: #3, waiting for 2 actions to finish INFO [ip-10-5-4-26.ec2.internal,16020,1451996087089-recovery-writer--pool5-t2] client.AsyncProcess: #4, waiting for 2 actions to finish
When the server is having these exceptions I can see this message (I checked the size of this file and it is very small):
Description: Replaying edits from hdfs://.../recovered.edits/0000000000000464197 Status: Running pre-WAL-restore hook in coprocessors (since 48mins, 45sec ago)
Another interesting thing that I noticed is the empty coprocessor list for the servers that are stuck.
For other hand, HBase master goes down after logging some of these messages:
GeneralBulkAssigner: Failed bulking assigning N regions
Any help would be awesome 🙂
Thank you
Pedro
Created 01-06-2016 07:48 AM
This problem occurs when meta regions are not assigned yet and preScannerOpen coprocessor waits for reading meta table for local indexes, which results in openregionthreads to wait forever because of deadlock.
you can solve this by increasing number of threads required to open the regions so that meta regions can be assigned even threads for local index table is still waiting to remove the deadlock.
<property> <name>hbase.regionserver.executor.openregion.threads</name> <value>100</value> </property>
Created 01-06-2016 02:39 PM
@Artem Ervits, Not now.. as we don't recommend to use local index in production yet. Local Index will probably be ready for production in next HDP release(but not sure) and this connection made (which access meta/namespace tables) during preScannerOpen will be moved to different place to avoid above problem.,