In Part 1 and Part 2 of this article series we discussed Index internals and some frequently faced issues. In this article we will cover few more index issues in the form of scenarios.
Scenario 4 : Index writes are failing, client retries exhaused, handler pool saturated while Index table regions are in transition.
Here client is trying to write to data table on server1 which is triggering index update for server2 (via server-to-server RPC) .
If Index region is stuck in transition, the index update RPC would also hang and eventually timeout and because of this the RPCs between client and server1 also gets stuck and timed out. Since client would make several retries to write this mutation, it would again lead to handler saturation on region server one causing another “deadlock” like situation.
This should be fixed in two steps:
Fix all Index RITs first as without this no client index maintenance or server side index rebuild would succeed.
As a holistic tuning, keep server side RPC timeout (hbase.rpc.timeout) relatively smaller than phoenix client side timeout (phoenix.query.timeoutMs). This is so that server side RPCs are not stuck due to hung client side queries.
Scenario 5 : Row count mismatch between Phoenix data table and Index table When data table is bulk loaded for existing primary keys.
There is a limitation in CSV BulkLoad for phoenix tables with secondary index. We must know that when an index update is carried out from data table server to index table server, the first step is to retrieve existing row state from index table , delete it and then insert the updated row.However CSV bulkload does not perform this check and delete steps and directly upserts the data to index table, thus making duplication of rows for same primary key.
As of writing this article, the only workaround was to delete index and build it fresh using IndexTool (async way).
Scenario 6: Region servers crashing, Index table disabled and ZK connections maxing out
In some cases, it was seen that region servers crashed due to long GC pauses , Index updates to other servers failed due to exceptions such as "Unable to create Native Threads" eventually leading to Index table going into "disabled" state. It was also observed that ZK connections were also maxing out from Region Servers. ("Too Many Connections" in ZK log)
There could be many intertwined reasons of what issue triggered the other issue but PHOENIX-4685 was seen to be playing part in many of such issues. Basically region servers (in an attempt to update index ) create sessions with zookeeper in order to do meta lookup, and this connection cache is maintained in region server heap which eventually grows large and causes GC pauses leading to server crashes, once region server crashes, Index update fails on this server and Index goes into disabled state and the vicious circle continues.
A careful examination of situation and detailed log analysis is required though to come to a conclusion on this bug.
In Part 4 of this article series, we will talk about Phoenix - Ranger relationship.