Created 01-05-2016 01:17 PM
Hi Guys,
I have been testing out the Phoenix Local Indexes and I'm facing an issue after restart the entire HBase cluster.
Scenario: I'm using Ambari 2.1.2 and HDP 2.3 using Phoenix 4.4 and HBase 1.1.1. My test cluster contains 10 machines and the main table contains 300 pre-split regions which implies 300 regions on local index table as well. To configure Phoenix I'm following this tutorial.
When I start a fresh cluster everything is just fine, the local index is created and I can insert data and query it using the index. The problem comes when I need to restart the cluster to update some configurations in that moment I'm not able to restart the cluster anymore. Most of the servers have exceptions like this one which looks that they are getting into a state where some region servers are waiting from regions that are not available yet in other region servers. (Kinda of a deadlock)
INFO [htable-pool7-t1] client.AsyncProcess: #5, table=_LOCAL_IDX_BIDDING_EVENTS, attempt=27/350 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region _LOCAL_IDX_BIDDING_EVENTS,57e4b17e4b17e4ac,1451943466164.253bdee3695b566545329fa3ac86d05e. is not online on ip-10-5-4-24.ec2.internal,16020,1451996088952 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2898) at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:947) at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1991) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107) at java.lang.Thread.run(Thread.java:745) on ip-10-5-4-24.ec2.internal,16020,1451942002174, tracking started null, retrying after=20001ms, replay=1ops INFO [ip-10-5-4-26.ec2.internal,16020,1451996087089-recovery-writer--pool5-t1] client.AsyncProcess: #3, waiting for 2 actions to finish INFO [ip-10-5-4-26.ec2.internal,16020,1451996087089-recovery-writer--pool5-t2] client.AsyncProcess: #4, waiting for 2 actions to finish
When the server is having these exceptions I can see this message (I checked the size of this file and it is very small):
Description: Replaying edits from hdfs://.../recovered.edits/0000000000000464197 Status: Running pre-WAL-restore hook in coprocessors (since 48mins, 45sec ago)
Another interesting thing that I noticed is the empty coprocessor list for the servers that are stuck.
For other hand, HBase master goes down after logging some of these messages:
GeneralBulkAssigner: Failed bulking assigning N regions
Any help would be awesome 🙂
Thank you
Pedro
Created 01-06-2016 07:48 AM
This problem occurs when meta regions are not assigned yet and preScannerOpen coprocessor waits for reading meta table for local indexes, which results in openregionthreads to wait forever because of deadlock.
you can solve this by increasing number of threads required to open the regions so that meta regions can be assigned even threads for local index table is still waiting to remove the deadlock.
<property> <name>hbase.regionserver.executor.openregion.threads</name> <value>100</value> </property>
Created 01-05-2016 04:36 PM
@Pedro Gandola do you have HBase Master High Availabily on? We recommend to run at least two masters at the same time. Also, we recommend you use Ambari rolling restart rather than stop-the-world restart of the whole cluster. With HA enabled, you can have one HBase master down and still maintain availability. You can also restart regions one at a time or trigger restart every so often, you can set time trigger for RS restarts. Time of stopping everything to change a configuration in hbase-site is long gone, you don't need to stop the whole cluster.
Created 01-05-2016 05:11 PM
Hi @Artem Ervits, Thanks for the info.
I was using the HA master for testing. Regarding with the full restart you are right. I followed Ambari which after we change any configuration it asks for a restart of all "affected" components and I clicked the button :). Is Ambari doing a proper rolling restart on this case? I know that it does it when we click "Restart All Region Servers". I have done full restarts with Ambari before but this problem has only started after I introduced local indexes. I need to dig a bit more about this problem.
Thanks
Created 01-05-2016 05:17 PM
@Pedro Gandola the local indexes are in tech preview and as all TP releases there is no support from HWX until it's production ready. If you do find a solution, please post here for the best of the community.
Created 01-05-2016 05:21 PM
@Artem Ervits, Sure! Thanks
Created 01-05-2016 05:27 PM
Ambari will restart everything that has stale configs, for you to take advantage of both worlds, (restarting stale configs and keeping cluster up), go through each host and restart components with stale configs per node, rather than per cluster as you were doing.
Created 01-05-2016 04:41 PM
additionally, did you see the warning about using local indexes in Phoenix as it being a technical preview?
The local indexing feature is a technical preview and considered under development. Do not use this feature in your production systems. If you have questions regarding this feature, contact Support by logging a case on our Hortonworks Support Portal.
Created 01-06-2016 07:48 AM
This problem occurs when meta regions are not assigned yet and preScannerOpen coprocessor waits for reading meta table for local indexes, which results in openregionthreads to wait forever because of deadlock.
you can solve this by increasing number of threads required to open the regions so that meta regions can be assigned even threads for local index table is still waiting to remove the deadlock.
<property> <name>hbase.regionserver.executor.openregion.threads</name> <value>100</value> </property>
Created 01-06-2016 12:12 PM
Hi @asinghal, It worked perfectly. Thanks
Created 01-06-2016 01:36 PM
I think this calls for a jira with Ambari for advisor?