Hi we are seeing an issue with our Cloudera setup. We have multiple data nodes and on each node we have phoenix set up to query on HBase tables.
What we have observed is that when one of the region server is going through Garbage collection (GC cycle) it becomes unavailable. Due to this unavailability of 1 region server, our entire cluster is becoming unavailable and phoenix is not able to connect to HBase and query the tables. We have around 40 region servers like that.
When we restart that region server, things comes back to normal and the phoenix can query the tables. Are we missing any specific configuration that will avoid blocking of all the region servers if one of them is not available.
Thanks for using Cloudera Community. Based on the Synopsis, your Team is observing Cluster Level impact for 1 RS undergoing GC Cycle & becoming unavailable. Things are normal after the concerned RS Restart.
There are 2 aspect here: (I) RS undergoing longer GC Cycle, (II) HBase Cluster Un-usability. Ideally, HBase Cluster won't become Un-usable if 1 RS is impacted. Having said that, if the RS is Unresponsive, the Query RPC is handled on the concerned RS would be delayed & ensure the Query responses are delayed or timing out. Under no circumstances, Phoenix accessibility to HBase is impacted.
Please confirm whether your Team Phoenix Queries are timing out or delayed when 1 RS is busy in GC Cycle (Different from Phoenix being unable to connect to HBase). If the concerned RS is hosting "hbase:meta", the same is feasible. As such, We need to focus on the RS undergoing GC for longer duration to mitigate any possible scenarios.
Have shared a Blog via Link  on GC for HBase. Additionally, Check if the RS GC Cycle are causing ZK Timeout or the GC Time was lesser than ZK Timeout.
@smdas Many thanks for your reply, we are looking at the inputs shared by you. Also we are going through the link you have shared regarding GC tuning.
While we are doing that, I wanted to bring up a point that we have a heavy read / write cluster. We do lots of selects and based on the result we either upsert / insert into the Hbase via Phoenix queries.
Do you think that could be a bottleneck too. We came across this link https://docs.cloudera.com/runtime/7.2.2/configuring-hbase/topics/hbase-advanced-configuration-write-... while we are trying to resolve the issue.
Are any of this configuration we need to check and configure that could also help along with the recommendations you have provided.
Thanks for the Update. The Link shared by you deals with improvement from MemStore Flush & eventual HFiles Compaction perspective. Currently, I am unfamiliar with the Blockers impacting your Environment.
For Example: If MemStore Writes are being delayed, We can consider reviewing the Flusher Thread. Similarly, if Compaction is a concern (Courtesy of Too-Many-Hfiles), Reviewing the Thread Count would help. Similarly, if MemStore Writes are being blocked owing to Too-Many-WALs, It's worth checking the "hbase.hstore.flusher.count" & "hbase.regionserver.max.logs". Most importantly, How's HDFS Performance & any Hot-Spotting.
In short, Evaluating Read & Write Performance collectively would be a large scope for your Team. I would recommend to start with either Read or Write, All Tables or Specific Table, All RegionServer vs Specific RegionServer & proceed accordingly.
Hope you are doing well. We wish to confirm if you have identified the Cause of the issue. If Yes, Kindly share the same to benefit our fellow Community Users as well. If no further assistance required, Please mark the Post as Solved.