Created on 11-17-2018 12:56 AM
In Part 1 and Part 2 of this article series we discussed Index internals and some frequently faced issues. In this article we will cover few more index issues in the form of scenarios.
Scenario 4 : Index writes are failing, client retries exhaused, handler pool saturated while Index table regions are in transition.
Here client is trying to write to data table on server1 which is triggering index update for server2 (via server-to-server RPC) .
If Index region is stuck in transition, the index update RPC would also hang and eventually timeout and because of this the RPCs between client and server1 also gets stuck and timed out. Since client would make several retries to write this mutation, it would again lead to handler saturation on region server one causing another “deadlock” like situation.
This should be fixed in two steps:
Scenario 5 : Row count mismatch between Phoenix data table and Index table When data table is bulk loaded for existing primary keys.
There is a limitation in CSV BulkLoad for phoenix tables with secondary index. We must know that when an index update is carried out from data table server to index table server, the first step is to retrieve existing row state from index table , delete it and then insert the updated row.However CSV bulkload does not perform this check and delete steps and directly upserts the data to index table, thus making duplication of rows for same primary key.
As of writing this article, the only workaround was to delete index and build it fresh using IndexTool (async way).
Scenario 6: Region servers crashing, Index table disabled and ZK connections maxing out
In some cases, it was seen that region servers crashed due to long GC pauses , Index updates to other servers failed due to exceptions such as "Unable to create Native Threads" eventually leading to Index table going into "disabled" state. It was also observed that ZK connections were also maxing out from Region Servers. ("Too Many Connections" in ZK log)
There could be many intertwined reasons of what issue triggered the other issue but PHOENIX-4685 was seen to be playing part in many of such issues. Basically region servers (in an attempt to update index ) create sessions with zookeeper in order to do meta lookup, and this connection cache is maintained in region server heap which eventually grows large and causes GC pauses leading to server crashes, once region server crashes, Index update fails on this server and Index goes into disabled state and the vicious circle continues.
A careful examination of situation and detailed log analysis is required though to come to a conclusion on this bug.
In Part 4 of this article series, we will talk about Phoenix - Ranger relationship.
Created on 07-12-2020 07:31 PM
Created on 07-14-2020 02:47 PM
@Thirupathi These articles were written keeping HDP 2.6.x versions in mind. With HDP 3 and CDH6 having Phoenix 5.0 , many issues have been resolved. But I cannot comment on case to case basis here. You will need to log a support ticket for more comprehensive discussion on specific JIRA basis.