About smdas

smdas · ‎03-20-2021

Hello @Priyanka26 Thanks for the Update. I haven't tried these Steps yet they look fine on Papers. As you are taking the BackUp of the Data Directory, We would have the HFiles for any concerns as well. Do let us know how things goes & most importantly, Do Plan to Upgrade to HDP v3.1.5. - Smarak

smdas · ‎03-19-2021

Hello @Priyanka26 Thanks for the Update. The referred JAR aren't available for download. Unfortunately, I am not familiar with any other means other than manual intervention (Start HBase on a new DataDir & Bulkload from previous DataDir being one of them). Such issues aren't present in HDP v3.1.5 onwards. If I find anything, I shall let you know. Yet, It's highly unlikely to come across any easier Solution. - Smarak

smdas · ‎03-19-2021

Hello @WayneWang As we haven't received any further Update, We are closing the Post assuming the issue was handled by the Steps shared above [1]. In HBase v1.x, We have limited choices with HBase v2.x using a new AssignmentManager (Details in HBASE-12439), which would assist in managing the RIT without ZooKeeper involvement. Thanks for using Cloudera Community. - Smarak [1] https://community.cloudera.com/t5/Support-Questions/Hbase-How-to-fix-failed-regions/m-p/312263/highlight/true#M225066

smdas · ‎03-19-2021

Hello @SurajP KIndly confirm the issue is resolved. If Yes, Please share the Steps for our fellow Community Members & mark the Post as resolved. If the issue persists, Please share the ZooKeeper Logs. - Smarak

smdas · ‎03-19-2021

Hello @JB0000000000001 Thanks for using Cloudera Community. First of all, Really appreciate your detailed analysis into the concerned issue. In short, a Spark Job writes a month worth of data into HBase per a month. Intermittently, the Spark Job fails on certain month & your Team observed ServerNotRunningYetException during the concerned period. The Primary issue appears to be RegionServer being terminated (Owing to certain reasons) & Master re-assigning the Regions to other Active RegionServers. Any Region remains in Transition (RIT) until the Region is Closed on the Now-Down-RegionServer + WAL Replay + Region Opened on New-Assigned-RegionServer. Typically, WAL Replay may be taking time causing the Spark Executor Task to be retried & failed during the concerned period. Again, this is my assumption based on the facts laid out by you. So, What can be avoided primarily: 1. RegionServer being terminated, 2. Spark Executor Task failure. For 2, We can definitely increase the Task failure from Default (4) to a Higher Value to ensure the collective failures is lower than the Time taken for a Region to be transitioned from 1 RS to another RS including WAL Replay. As the Job is run once a month, the above Config Change appears to be an easy way out. For 1, I am assuming the RS are going down owing to exceeding their ZooKeeper Timeout during the period of Garbage Collection (Likely, Full GC seeing the Memory Usage). As such, We have the following pointers: If a RegionServer is failing with Out-Of-Memory, the Process JVM *out file would capture the same. If a RegionServer is failing owing to JVMPause from GC exceeding ZooKeeper Timeout, We can check the Logs for the time taken for JVMPause & the ZooKeeper Timeout. If we are seeing JVMPause exceeding ZooKeeper Timeout, Increasing the ZooKeeper Timeout to a Value higher than the Highest JVMPause would help. Negative of this Change is a RegionServer failure would be delayed in being detected. If JVMPause is the Cause, Review the GC Logs. Depending on CMS or G1 GC, the Logs offers a lot of details into the JVMPause & proceed accordingly. Link [1] offers a Great Info by Plumbr. Your MemStore is allocated 25% of the RegionServer Heap. You have written 2 Heap (32GB & 20GB), so I am unsure which is the one. Yet, the above indicates the MemStore is likely to being filled quickly. Your Team have Off-Heap Bucket Cache yet still have an extremely low value for Write Cache (MemStore). I am assuming the same must be causing the YoungGen to be filled up quickly, thereby causing MinorGC & subsequently, the FullGC may be initiated, which is causing the ZooKeeper Timeout. We can try giving a bit more MemStore Space. As your Team didn't complain of Performance, I am assuming the frequent MemStore Flush isn't causing any HDFS impact. Any Hot-Spotting i.e. Writes being managed by few Regions/RegionServer, thereby over-loading the RegionServer. Finally, Have your Team considered using Bulk-Loading to bypass the Memory factor in Load Jobs like you are describing. - Smarak [1] https://plumbr.io/handbook/garbage-collection-algorithms-implementations

smdas · ‎03-18-2021

Hello @Priyanka26 Thanks for the Update. I see you have posted the Q in a new Post & I have responded to the Q over there. Let us know where you stand with respect to the RIT for "prod.timelineservice.entity" Table. Assuming you have solved the issue, Kindly share the Steps for our fellow Community Users. If the issue remains, Please do share the details discussed in our response. - Smarak

smdas · ‎03-18-2021

Hello @Priyanka26 Thanks for using Cloudera Community. Based on the Post, Your team have Namespace Region "0c72d4be7e562a2ec8a86c3ec830bdc5" causing the Master StartUp initialization. Using HBCK2 is throwing a Kerberos Exception. In HDP v3.1.0, We have a Bug wherein the HBCK2 JAR can't used with the available Hbase-Client & Hbase-Server JAR in a Secure Cluster. There is no issue with the way your team is using the HBCK2. Owing to the Bug being mentioned above, the HBCK2 Jar is throwing the concerned exception. Without the modified Hbase-Client & Hbase-Server JAR, We can try to re-initialize the HBase Cluster yet only if the same isn't a Production Cluster. - Smarak

smdas · ‎03-16-2021

Hello @Kenzan Thanks for the Update. Typically, Such issues lies with Balancing & the TRACE Logging prints the finer details into the same. Ideally, We should have "StochasticLoadBalancer" as Default & "SimpleLoadBalancer" (Set by your Team) extends on BaseLoadBalancer. Have shared 2 Links documenting the 2 Balancer & their running configurations. As your issue has been resolved, Kindly mark the Post as Resolved to ensure we close the Post as well. Thanks again for being a Cloudera Community Members & contributing as well. - Smarak [1] https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/balancer/SimpleLoadBalancer.html [2] http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.html

smdas · ‎03-16-2021

Hello @sheshk11 Thanks for sharing your knowledge (Knowledge Article) on managing the DISABLING Table. As @tencentemr mentioned, It has been helpful. Few other details I wish to add: 1. Using the Link [1] HBCK setTableState to perform the same on HBase v2.x. The advantage of using the same is to ensure the manual intervention is avoided to avoid any unintended HBase Metadata manipulation. 2. In certain cases, the Regions belonging to the Table would be in Transition as well. If we are Disabling the Table, It's best to review the RegionState for the Table as well. Link [1] HBCK setRegionState can assist here. As the Post is a KA, I shall mark the same as Resolved. Thank You for posting the same for assisting fellow Community Members. - Smarak [1] https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2

smdas · ‎03-16-2021

Hello @novice_tester Thanks for using Cloudera Community. To your query, Flume has been replaced by CFM (Cloudera Flow Management). The Link [1] covers the details around the various components being deprecated in CDP. The Link [2] covers additional details on the same by @TimothySpann . - Smarak [1] https://docs.cloudera.com/cdp-private-cloud/latest/release-guide/topics/cdpdc-rt-updated-cdh-components.html [2] https://www.datainmotion.dev/2019/08/migrating-apache-flume-flows-to-apache.html

Online	Offline
Last Visited	‎01-12-2026 06:15 AM

Member Since	‎01-16-2018 09:55 AM
Last Visited	‎01-12-2026 06:15 AM
Posts	613
Kudos received	48

Cloudera Community

Re: Timeout: PBJ session not going idle

Re: Impact of Upgrading EKS from 1.29 to 1.31 on C...

Re: Capture airflow run duration

Re: How to enable IAM for apache airflow

Re: Apache Airflow can not connect to mssql 2008

Re: Hbase namespace table in not online

Re: Hbase namespace table in not online

Re: Hbase - How to fix failed regions

Re: Zookeeper process not getting started

Re: How to know why hbase regionserver fails?

Re: One region for "prod.timelineservice.entity" h...

Re: Hbase namespace table in not online

Re: Only one HBase Regionserver has no region & n...

Re: Hbase table is stuck in "Disabling" state. Nei...

Re: Apache Flume on CDP