Support Questions

Find answers, ask questions, and share your expertise

Can phoenix local indexes create a deadlock during an HBase restart?

avatar
Contributor

Hi Guys,

I have been testing out the Phoenix Local Indexes and I'm facing an issue after restart the entire HBase cluster.

Scenario: I'm using Ambari 2.1.2 and HDP 2.3 using Phoenix 4.4 and HBase 1.1.1. My test cluster contains 10 machines and the main table contains 300 pre-split regions which implies 300 regions on local index table as well. To configure Phoenix I'm following this tutorial.

When I start a fresh cluster everything is just fine, the local index is created and I can insert data and query it using the index. The problem comes when I need to restart the cluster to update some configurations in that moment I'm not able to restart the cluster anymore. Most of the servers have exceptions like this one which looks that they are getting into a state where some region servers are waiting from regions that are not available yet in other region servers. (Kinda of a deadlock)

INFO  [htable-pool7-t1] client.AsyncProcess: #5, table=_LOCAL_IDX_BIDDING_EVENTS, attempt=27/350 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region _LOCAL_IDX_BIDDING_EVENTS,57e4b17e4b17e4ac,1451943466164.253bdee3695b566545329fa3ac86d05e. is not online on ip-10-5-4-24.ec2.internal,16020,1451996088952
	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2898)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:947)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1991)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
	at java.lang.Thread.run(Thread.java:745)
 on ip-10-5-4-24.ec2.internal,16020,1451942002174, tracking started null, retrying after=20001ms, replay=1ops
INFO  [ip-10-5-4-26.ec2.internal,16020,1451996087089-recovery-writer--pool5-t1] client.AsyncProcess: #3, waiting for 2  actions to finish
INFO  [ip-10-5-4-26.ec2.internal,16020,1451996087089-recovery-writer--pool5-t2] client.AsyncProcess: #4, waiting for 2  actions to finish

When the server is having these exceptions I can see this message (I checked the size of this file and it is very small):

Description: Replaying edits from hdfs://.../recovered.edits/0000000000000464197
Status: Running pre-WAL-restore hook in coprocessors (since 48mins, 45sec ago)

Another interesting thing that I noticed is the empty coprocessor list for the servers that are stuck.

For other hand, HBase master goes down after logging some of these messages:

GeneralBulkAssigner: Failed bulking assigning N regions

Any help would be awesome 🙂

Thank you

Pedro

1 ACCEPTED SOLUTION

avatar

Hi @Pedro Gandola

This problem occurs when meta regions are not assigned yet and preScannerOpen coprocessor waits for reading meta table for local indexes, which results in openregionthreads to wait forever because of deadlock.

you can solve this by increasing number of threads required to open the regions so that meta regions can be assigned even threads for local index table is still waiting to remove the deadlock.

<property> <name>hbase.regionserver.executor.openregion.threads</name> <value>100</value> </property>

View solution in original post

10 REPLIES 10

avatar
Master Mentor

@Pedro Gandola do you have HBase Master High Availabily on? We recommend to run at least two masters at the same time. Also, we recommend you use Ambari rolling restart rather than stop-the-world restart of the whole cluster. With HA enabled, you can have one HBase master down and still maintain availability. You can also restart regions one at a time or trigger restart every so often, you can set time trigger for RS restarts. Time of stopping everything to change a configuration in hbase-site is long gone, you don't need to stop the whole cluster.

avatar
Contributor

Hi @Artem Ervits, Thanks for the info.

I was using the HA master for testing. Regarding with the full restart you are right. I followed Ambari which after we change any configuration it asks for a restart of all "affected" components and I clicked the button :). Is Ambari doing a proper rolling restart on this case? I know that it does it when we click "Restart All Region Servers". I have done full restarts with Ambari before but this problem has only started after I introduced local indexes. I need to dig a bit more about this problem.

Thanks

avatar
Master Mentor

@Pedro Gandola the local indexes are in tech preview and as all TP releases there is no support from HWX until it's production ready. If you do find a solution, please post here for the best of the community.

avatar
Contributor

@Artem Ervits, Sure! Thanks

avatar
Master Mentor

Ambari will restart everything that has stale configs, for you to take advantage of both worlds, (restarting stale configs and keeping cluster up), go through each host and restart components with stale configs per node, rather than per cluster as you were doing.

avatar
Master Mentor

additionally, did you see the warning about using local indexes in Phoenix as it being a technical preview?

The local indexing feature is a technical preview and considered under development. Do not use this feature in your production systems. If you have questions regarding this feature, contact Support by logging a case on our Hortonworks Support Portal.

avatar

Hi @Pedro Gandola

This problem occurs when meta regions are not assigned yet and preScannerOpen coprocessor waits for reading meta table for local indexes, which results in openregionthreads to wait forever because of deadlock.

you can solve this by increasing number of threads required to open the regions so that meta regions can be assigned even threads for local index table is still waiting to remove the deadlock.

<property> <name>hbase.regionserver.executor.openregion.threads</name> <value>100</value> </property>

avatar
Contributor

Hi @asinghal, It worked perfectly. Thanks

avatar
Master Mentor

I think this calls for a jira with Ambari for advisor?