Support Questions

Find answers, ask questions, and share your expertise

Can phoenix local indexes create a deadlock during an HBase restart?

Hi Guys,

I have been testing out the Phoenix Local Indexes and I'm facing an issue after restart the entire HBase cluster.

Scenario: I'm using Ambari 2.1.2 and HDP 2.3 using Phoenix 4.4 and HBase 1.1.1. My test cluster contains 10 machines and the main table contains 300 pre-split regions which implies 300 regions on local index table as well. To configure Phoenix I'm following this tutorial.

When I start a fresh cluster everything is just fine, the local index is created and I can insert data and query it using the index. The problem comes when I need to restart the cluster to update some configurations in that moment I'm not able to restart the cluster anymore. Most of the servers have exceptions like this one which looks that they are getting into a state where some region servers are waiting from regions that are not available yet in other region servers. (Kinda of a deadlock)

INFO  [htable-pool7-t1] client.AsyncProcess: #5, table=_LOCAL_IDX_BIDDING_EVENTS, attempt=27/350 failed=1ops, last exception: org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Region _LOCAL_IDX_BIDDING_EVENTS,57e4b17e4b17e4ac,1451943466164.253bdee3695b566545329fa3ac86d05e. is not online on ip-10-5-4-24.ec2.internal,16020,1451996088952
	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2898)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:947)
	at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1991)
	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:32213)
	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2114)
	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:101)
	at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
	at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
	at java.lang.Thread.run(Thread.java:745)
 on ip-10-5-4-24.ec2.internal,16020,1451942002174, tracking started null, retrying after=20001ms, replay=1ops
INFO  [ip-10-5-4-26.ec2.internal,16020,1451996087089-recovery-writer--pool5-t1] client.AsyncProcess: #3, waiting for 2  actions to finish
INFO  [ip-10-5-4-26.ec2.internal,16020,1451996087089-recovery-writer--pool5-t2] client.AsyncProcess: #4, waiting for 2  actions to finish

When the server is having these exceptions I can see this message (I checked the size of this file and it is very small):

Description: Replaying edits from hdfs://.../recovered.edits/0000000000000464197
Status: Running pre-WAL-restore hook in coprocessors (since 48mins, 45sec ago)

Another interesting thing that I noticed is the empty coprocessor list for the servers that are stuck.

For other hand, HBase master goes down after logging some of these messages:

GeneralBulkAssigner: Failed bulking assigning N regions

Any help would be awesome 🙂

Thank you

Pedro

1 ACCEPTED SOLUTION

Hi @Pedro Gandola

This problem occurs when meta regions are not assigned yet and preScannerOpen coprocessor waits for reading meta table for local indexes, which results in openregionthreads to wait forever because of deadlock.

you can solve this by increasing number of threads required to open the regions so that meta regions can be assigned even threads for local index table is still waiting to remove the deadlock.

<property> <name>hbase.regionserver.executor.openregion.threads</name> <value>100</value> </property>

View solution in original post

10 REPLIES 10

Mentor

@Pedro Gandola do you have HBase Master High Availabily on? We recommend to run at least two masters at the same time. Also, we recommend you use Ambari rolling restart rather than stop-the-world restart of the whole cluster. With HA enabled, you can have one HBase master down and still maintain availability. You can also restart regions one at a time or trigger restart every so often, you can set time trigger for RS restarts. Time of stopping everything to change a configuration in hbase-site is long gone, you don't need to stop the whole cluster.

Hi @Artem Ervits, Thanks for the info.

I was using the HA master for testing. Regarding with the full restart you are right. I followed Ambari which after we change any configuration it asks for a restart of all "affected" components and I clicked the button :). Is Ambari doing a proper rolling restart on this case? I know that it does it when we click "Restart All Region Servers". I have done full restarts with Ambari before but this problem has only started after I introduced local indexes. I need to dig a bit more about this problem.

Thanks

Mentor

@Pedro Gandola the local indexes are in tech preview and as all TP releases there is no support from HWX until it's production ready. If you do find a solution, please post here for the best of the community.

@Artem Ervits, Sure! Thanks

Mentor

Ambari will restart everything that has stale configs, for you to take advantage of both worlds, (restarting stale configs and keeping cluster up), go through each host and restart components with stale configs per node, rather than per cluster as you were doing.

Mentor

additionally, did you see the warning about using local indexes in Phoenix as it being a technical preview?

The local indexing feature is a technical preview and considered under development. Do not use this feature in your production systems. If you have questions regarding this feature, contact Support by logging a case on our Hortonworks Support Portal.

Hi @Pedro Gandola

This problem occurs when meta regions are not assigned yet and preScannerOpen coprocessor waits for reading meta table for local indexes, which results in openregionthreads to wait forever because of deadlock.

you can solve this by increasing number of threads required to open the regions so that meta regions can be assigned even threads for local index table is still waiting to remove the deadlock.

<property> <name>hbase.regionserver.executor.openregion.threads</name> <value>100</value> </property>

Hi @asinghal, It worked perfectly. Thanks

Mentor

I think this calls for a jira with Ambari for advisor?

@Artem Ervits, Not now.. as we don't recommend to use local index in production yet. Local Index will probably be ready for production in next HDP release(but not sure) and this connection made (which access meta/namespace tables) during preScannerOpen will be moved to different place to avoid above problem.,

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.