Community Articles

Find and share helpful community-sourced technical articles.
Announcements
Celebrating as our community reaches 100,000 members! Thank you!
avatar
Cloudera Employee

We frequently get asked questions by customers about how to run, deploy and/or debug Cloudera’s Operational Database, which is based on Apache HBase & Phoenix. This post shares the most frequently asked questions and how they were addressed.

 

  • Can I run HBase on HDP 7.1.4 for my production environment?

    Answer: 
    HDP 7.1.4 is only an intermediate step in the upgrade from HDP2 or HDP 3 to CDP. Once HDP 7.1.4 is deployed, you need to use the AM2CM migration tool to complete the upgrade to CDP. HDP 7 is not supported for production environments. HBase is not expected to be stable in HDP 7 but will be stable after the completion of the upgrade to CDP.

  • How can I repair my HBase metadata?

    Answer: Use the hbck2 meta repair feature. For more information about this feature, see this github link. Alternatively, if there is no important data there, you can clean up the whole HDFS HBase root directory and also remove the HBase ‘znode’ from ZooKeeper using the deleteall/hbase command. This alternative solution might require you to reinitialize the HBase folders using the reinit Cloudera Manager command.
    Please file a support case to get the correct version of the hbck2 tool for your deployment.

  • Can I skip the hbase pre-upgrade validate-cp commands during the Check co-processor class pre-upgrade step if SecureBulkLoadEndpoint is the only error I get?

    Answer: Yes, you can skip the co-processor class validation if SecureBulkLoadEndpoint is the only error. SecureBulkLoadEndpoint is native in HBase; there is no need to specify it as a co-processor anymore. In a CDH 5 cluster, Cloudera Manager automatically adds SecureBulkLoadEndpoint as a co-processor to the hbase-site.xml configuration file. When upgrading to a CDP cluster, this parameter is not added to the hbase-site.xml configuration file by Cloudera Manager as it was moved into the core-site.xml configuration file.

  • Does the block_cache_hit_ratio metric consider all IO in example including those tables having BLOCKCACHE=>false? Does the block_cache_express_hit_ratio metric consider just those tables that can use block cache?

    Answer: The different block_cache_hit_ratio metrics are only collected from tables that use block cache. The hit ratio equals (total number of block cache hits) / (total number of block cache requests). Both of these counters get increased only for HBase GET/ SCAN operations on tables that have block cache enabled. Each region server has only one block cache, and all the co-located regions on the region servers share the block cache. So, there is a single ‘hit ratio’ in each region server.

  • My queries against an HBase table from my clients are failing due to timeout. How can I investigate this issue?

    Answer: Tune the hbase.ipc.warn.response.time parameter (default value is 10000, meaning 10 seconds) so that you get “responseTooSlow” warning in the RegionServer log about slow operations. That warning will include client IP, queue time, processing time, and region information, just to mention a few. Note that this will only show if the execution of the operation was slow because the request was waiting a long time in the queue in the RegionServer before being processed, but not triggered.
    Additionally, you can check the queue lengths in the RegionServer metrics to see if the RegionServer is struggling with some operations. The number of handlers for each operation type can be fine-tuned as well. If you want to check a specific row key, you can implement a custom coprocessor that gets triggered before each GET or SCAN and can count/log information about a given row key.

  • How does HBase Major compaction work?

    Answer: By default, HBase Major compaction runs once in 7 days. You can change this configuration by updating the value of the hbase.hregion.majorcompaction property (it is expressed in milliseconds, so its default value is 604800000). The hbase.hregion.majorcompaction.jitter property had the same value as the hbase.hregion.majorcompaction property expressed in milliseconds. As a result, compaction starts at a random time during the given time window, which by default is 7 days. If major compactions are causing disruptions in your environment, you can disable time-based major compactions by setting the hbase.hregion.majorcompaction property to 0. In this case, user-requested and size-based major compactions still run.

  • What can cause the compacted store files to be kept open for a long time?

    Answer: Normally, the compacted store files are kept open for read-heavy workloads. When there are lots of active scanners referring to HFiles, they prevent already compacted store files from getting cleaned.

  • What should I do when a snapshot is failing?

    Answer: A snapshot can fail if the table is in an inconsistent state before taking the snapshot, or if a region gets deleted while the snapshot is in progress. If you are getting a “Skipping region…” warning, check that the regions exist in the hbase:meta table. If they do not exist in the table, check when the CatalogJanitor ran last. The region being cleaned up slowly is a normal operation. However, if the regions do exist in the hbase:meta table but have no family directory, that is an HBase surgery operation and the hbck2 has to be used to repair the table. The hbck2 can also be used to check for any table inconsistencies.
    If you see any empty regions in the hbase:meta table, you should check that the CatalogJanitor is removing otherwise empty regions. Ensure that Catalog Janitor has run in the recent past.
    Check the master log from when it reads the existing ProcedureV2 WALs and it will print out any corrupt PIDs that need to be bypassed hbck2. You should be able to bypass the corrupt PIDs; check that the MasterProc WALs in the filesystem drain out and then ensure that the CatalogJanitor is running and cleaning up the empty regions.

  • Is there a way to restrain a particular user to access the meta:table ?

    Answer: There is a check to avoid quotas for system tables. If other operations are impacted due to a noisy user, increase the meta handlers using the hbase.regionserver.metahandler.count property.

  • What to check if I have an HBase replication issue with SASL authentication?

    Answer: If you are having an HBase replication issue with SASL authentication, you should check the following:
     
    - ACLs
    - jaas.conf

    Are the source and target clusters on the same Kerberos domain?
        If yes: Nothing else to check
        If no: Check if cross-realm trust is set for both directions

    The settings for allowed encryption types (permitted_enctypes) in the /etc/krb5.conf configuration file.
    Debugging:
    - Set the DEBUG logs for the ZooKeeper client, or anything under org.apache.zookeeper
    - Enable Kerberos debug logs by adding the following configuration to the HBase server java options:   -Dsun.security.krb5.debug=true

  • What to do when a GET command fails with a “file does not exist error” and the lingering reference file points to a path that does not exist?

    Answer: Sideline the reference file and move the daughter region; the one from which you removed the reference, to another RegionServer.

  • How should the HBASE-25166 issue be handled by customers who use the MOB feature?

    Answer:
    As a workaround, disable MOB compaction chore thread by setting the hbase.mob.compaction.chore.period property to 0. Do not run the MOB cleaner because it can cause data loss on MOB tables (TSB-506). You have to run major compaction for the table manually. Upgrading to Cloudera Runtime 7.1.7 or higher versions is recommended.
846 Views
0 Kudos