About smdas

smdas · ‎03-20-2021

Hello @JB0000000000001 Thank You for the kind words. I intended to reciprocate the level of detailing you posted in the Q, which was helpful for us. Based on your Update, Few pointers: The GC Flags indicates your Team is using CMS & as you mentioned, the GC Cycles JVMPause would trace the logging as you highlighted. Note that "No GCs detected" is observed when a JVMPause isn't GC induced. If your Team observe JVMPause tracing with "No GCs detected", the shared Community Post is a good pointer to evaluating Host Level concerns. 10GB RegionServer Heap is likely to not require 60 Seconds CleanUp, with 60 Seconds being the HBase-ZooKeeper Timeout, unless the 10GB is being filled frequently & we are having a continuous GC Cycle. In other words, if 1 FullGC is immediately followed by another FullGC (Visible in GC Logs), We wouldn't see 1 Large FullGC yet the cumulative impact is worth reviewing & require contemplation of the Eden-TenuredGeneration scale. While 3x Sharding than RegionServer is good for Parallelism, the Write Pattern matters based on the RowKey. If the RowKey of Table ensure Writes are being concentrated into Regions on 1 RegionServer, the concerned RegionServer would be a Bottleneck. While Writes is being carried out, Reviewing the concerned Table View via HMaster UI would offer insight into the same. The *out file of RegionServer would capture the OOME, yet the previous one & not the current one. Additionally, the JVMFlag can be adjusted to include +HeapDumpOnOutOfMemoryError & related parameters for HeapDump path. Link [1] covers the same. One pointer is to review the HMaster Logs for "Ephemeral" (Prefer Case Insensitive) associated tracing. The HMaster trace when an Ephemeral ZNode is being deleted & then, review the RegionServer (Whose Ephemeral ZNode was removed) Logs. This is based on the point wherein your team is (Maybe) reviewing the Spark Job Log for RegionServer unavailability & tracing the RS Logs. The HMaster approach is more convenient & accurate. Do keep us posted on how things goes. - Smarak [1] Command-Line Options - Troubleshooting Guide for HotSpot VM (oracle.com)

smdas · ‎03-20-2021

Hello @Priyanka26 Thanks for the Update. I haven't tried these Steps yet they look fine on Papers. As you are taking the BackUp of the Data Directory, We would have the HFiles for any concerns as well. Do let us know how things goes & most importantly, Do Plan to Upgrade to HDP v3.1.5. - Smarak

smdas · ‎03-19-2021

Hello @Priyanka26 Thanks for the Update. The referred JAR aren't available for download. Unfortunately, I am not familiar with any other means other than manual intervention (Start HBase on a new DataDir & Bulkload from previous DataDir being one of them). Such issues aren't present in HDP v3.1.5 onwards. If I find anything, I shall let you know. Yet, It's highly unlikely to come across any easier Solution. - Smarak

smdas · ‎03-19-2021

Hello @JB0000000000001 Thanks for using Cloudera Community. First of all, Really appreciate your detailed analysis into the concerned issue. In short, a Spark Job writes a month worth of data into HBase per a month. Intermittently, the Spark Job fails on certain month & your Team observed ServerNotRunningYetException during the concerned period. The Primary issue appears to be RegionServer being terminated (Owing to certain reasons) & Master re-assigning the Regions to other Active RegionServers. Any Region remains in Transition (RIT) until the Region is Closed on the Now-Down-RegionServer + WAL Replay + Region Opened on New-Assigned-RegionServer. Typically, WAL Replay may be taking time causing the Spark Executor Task to be retried & failed during the concerned period. Again, this is my assumption based on the facts laid out by you. So, What can be avoided primarily: 1. RegionServer being terminated, 2. Spark Executor Task failure. For 2, We can definitely increase the Task failure from Default (4) to a Higher Value to ensure the collective failures is lower than the Time taken for a Region to be transitioned from 1 RS to another RS including WAL Replay. As the Job is run once a month, the above Config Change appears to be an easy way out. For 1, I am assuming the RS are going down owing to exceeding their ZooKeeper Timeout during the period of Garbage Collection (Likely, Full GC seeing the Memory Usage). As such, We have the following pointers: If a RegionServer is failing with Out-Of-Memory, the Process JVM *out file would capture the same. If a RegionServer is failing owing to JVMPause from GC exceeding ZooKeeper Timeout, We can check the Logs for the time taken for JVMPause & the ZooKeeper Timeout. If we are seeing JVMPause exceeding ZooKeeper Timeout, Increasing the ZooKeeper Timeout to a Value higher than the Highest JVMPause would help. Negative of this Change is a RegionServer failure would be delayed in being detected. If JVMPause is the Cause, Review the GC Logs. Depending on CMS or G1 GC, the Logs offers a lot of details into the JVMPause & proceed accordingly. Link [1] offers a Great Info by Plumbr. Your MemStore is allocated 25% of the RegionServer Heap. You have written 2 Heap (32GB & 20GB), so I am unsure which is the one. Yet, the above indicates the MemStore is likely to being filled quickly. Your Team have Off-Heap Bucket Cache yet still have an extremely low value for Write Cache (MemStore). I am assuming the same must be causing the YoungGen to be filled up quickly, thereby causing MinorGC & subsequently, the FullGC may be initiated, which is causing the ZooKeeper Timeout. We can try giving a bit more MemStore Space. As your Team didn't complain of Performance, I am assuming the frequent MemStore Flush isn't causing any HDFS impact. Any Hot-Spotting i.e. Writes being managed by few Regions/RegionServer, thereby over-loading the RegionServer. Finally, Have your Team considered using Bulk-Loading to bypass the Memory factor in Load Jobs like you are describing. - Smarak [1] https://plumbr.io/handbook/garbage-collection-algorithms-implementations

smdas · ‎03-18-2021

Hello @Priyanka26 Thanks for using Cloudera Community. Based on the Post, Your team have Namespace Region "0c72d4be7e562a2ec8a86c3ec830bdc5" causing the Master StartUp initialization. Using HBCK2 is throwing a Kerberos Exception. In HDP v3.1.0, We have a Bug wherein the HBCK2 JAR can't used with the available Hbase-Client & Hbase-Server JAR in a Secure Cluster. There is no issue with the way your team is using the HBCK2. Owing to the Bug being mentioned above, the HBCK2 Jar is throwing the concerned exception. Without the modified Hbase-Client & Hbase-Server JAR, We can try to re-initialize the HBase Cluster yet only if the same isn't a Production Cluster. - Smarak

smdas · ‎03-16-2021

Hello @Kenzan Thanks for the Update. Typically, Such issues lies with Balancing & the TRACE Logging prints the finer details into the same. Ideally, We should have "StochasticLoadBalancer" as Default & "SimpleLoadBalancer" (Set by your Team) extends on BaseLoadBalancer. Have shared 2 Links documenting the 2 Balancer & their running configurations. As your issue has been resolved, Kindly mark the Post as Resolved to ensure we close the Post as well. Thanks again for being a Cloudera Community Members & contributing as well. - Smarak [1] https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/balancer/SimpleLoadBalancer.html [2] http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.html

smdas · ‎03-16-2021

Hello @sheshk11 Thanks for sharing your knowledge (Knowledge Article) on managing the DISABLING Table. As @tencentemr mentioned, It has been helpful. Few other details I wish to add: 1. Using the Link [1] HBCK setTableState to perform the same on HBase v2.x. The advantage of using the same is to ensure the manual intervention is avoided to avoid any unintended HBase Metadata manipulation. 2. In certain cases, the Regions belonging to the Table would be in Transition as well. If we are Disabling the Table, It's best to review the RegionState for the Table as well. Link [1] HBCK setRegionState can assist here. As the Post is a KA, I shall mark the same as Resolved. Thank You for posting the same for assisting fellow Community Members. - Smarak [1] https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2

smdas · ‎03-16-2021

Hello @novice_tester Thanks for using Cloudera Community. To your query, Flume has been replaced by CFM (Cloudera Flow Management). The Link [1] covers the details around the various components being deprecated in CDP. The Link [2] covers additional details on the same by @TimothySpann . - Smarak [1] https://docs.cloudera.com/cdp-private-cloud/latest/release-guide/topics/cdpdc-rt-updated-cdh-components.html [2] https://www.datainmotion.dev/2019/08/migrating-apache-flume-flows-to-apache.html

smdas · ‎03-14-2021

Hello @Rjkoop Thanks for posting the Update & confirming the Q has been resolved. In short, the Article requires us to set the 3 Configurations you specified ["hbase.security.exec.permission.checks", "hbase.security.access.early_out", "hfile.format.version"] along with enabling the "HBase Secure Authorization" (Mandatory for "HBase Cell-Level ACLs" enabling). Additionally, Link [1] documents the ACL functionality in detail as well. As the Post is Solved, I shall mark the same likewise as well. - Smarak [1] https://hbase.apache.org/book.html#hbase.accesscontrol.configuration

smdas · ‎03-13-2021

Hello @Kenzan Thanks for using Cloudera Community. Based on the synopsis, Your Team have 1 RegionServer being allocated no Regions. Deleting the RegionServer & adding the same afresh doesn't help either. The 1st Screen-Shot showing the Log shows the RegionServer received a ZooKeeper expiry. It's likely the RegionServer experienced a ZooKeeper Timeout or the Master didn't receive any Heartbeat for the same. As 1 RegionServer is impacted, Review the Host Level concerns (CPU/Memory) if the RegionServer is being aborted (Likely, No relationship with Zero Region assignment). Coming the Zero Region assignment, Enable TRACE Logging for HMaster Balancer Thread or briefly enable the Complete HMaster Trace Logging (HMaster UI > LogLevel > "org.apache.hadoop.hbase" & TRACE for "Set Log Level"). This would enable the TRACE Logging for HMaster Service & capture any Balancer associated tracing, which would confirm the reasoning for AssignmentManager skipping the Region from any Region assignment. Once the TRACE Logging is captured & a Balancer Run has been captured to confirm the reasoning for Region Balancing being skipped, We can set the Logging to INFO again. - Smarak

Online	Offline
Last Visited	‎01-12-2026 06:15 AM

Member Since	‎01-16-2018 09:55 AM
Last Visited	‎01-12-2026 06:15 AM
Posts	613
Kudos received	48

Cloudera Community

Re: Timeout: PBJ session not going idle

Re: Impact of Upgrading EKS from 1.29 to 1.31 on C...

Re: Capture airflow run duration

Re: How to enable IAM for apache airflow

Re: Apache Airflow can not connect to mssql 2008

Re: How to know why hbase regionserver fails?

Re: Hbase namespace table in not online

Re: Hbase namespace table in not online

Re: How to know why hbase regionserver fails?

Re: Hbase namespace table in not online

Re: Only one HBase Regionserver has no region & n...

Re: Hbase table is stuck in "Disabling" state. Nei...

Re: Apache Flume on CDP

Re: Cloudera 6.2.1 Hbase cell level security

Re: Only one HBase Regionserver has no region & n...