Community Articles

Find and share helpful community-sourced technical articles.
avatar
Cloudera Employee

Bulk loading is the process of preparing and loading HFiles (HBase’s own file format) directly into the RegionServers, thus bypassing the write path. 

This obviates many issues, such as:

  • MemStores getting full
  • WALs getting bigger
  • Compaction and flush queues getting swollen
  • Garbage Collector getting out of control because of inserts range in the megabytes
  • Latency shoots up while importing data

HBase already has a replication functionality called Cluster Replication (see official documentation here). Why isn’t that enough if you want to replicate the bulk loaded data as well?


The “standard” replication uses a source-push methodology. When the active cluster (source) receives an edit to a column family with replication enabled, that edit is propagated to all destination clusters using the WAL for that column family on the RegionServer managing the relevant region.


However, in the case of bulk load, only the event (the fact that there is bulk load happening) is captured in the WAL, with reference to the HFile. The data being loaded is not recorded.

 

By enabling BulkLoad Replication, the active HBase RegionServer will also send these WAL entries to the peer cluster. Peer cluster will read these WAL entries and copy the HFiles from the active source cluster in the peer cluster staging directory, and basically, from here it’s just a standard bulk load.

 

For more information on bulk loading, you can refer to a previous blog post or see the Apache documentation. A good function description on bulk loading can be found attached to the HBASE Jira.

 

Bulk load Replication is supported from CDH 6.3 (including CDP releases).

HFile copy mechanism and setting it up

In order to copy the HFiles from the source cluster to the peer cluster, the peer cluster will need the active cluster’s FileSystem client configurations. FileSystem (FS) could be HDFS or any other file system supported by HBase.

There could be multiple active clusters, so the peer needs to be able to identify the client configurations. Each active cluster must configure a unique replication cluster id. This ID will be sent to the peer cluster as part of a replication request. 

The default implementation (making a custom implementation is out of the scope of this post) for getting the source cluster’s FileSystem client configurations needs each peer cluster to have a replication configuration directory. This can be configured using the hbase.replication.conf.dir property. The directory will have a sub-directory for each one of the active source clusters. Sub-directory name must be the same as the active cluster unique replication id. So the path to the respective FS client configurations will be hbase.replication.conf.dir/hbase.replication.cluster.id.

 

Note: If there are any changes in FS client configurations in the replication configuration directory, all the RegionServers in the peer cluster must be restarted.

Replication HFile Cleaner

If compaction, merge or split happens, the HFile is moved to the archive folder. This HFile will be cleaned up by HFile cleaner periodically. To avoid this before replication completes, the replication module will maintain these bulk loaded HFile paths into ZooKeeper for each peer. 

  • Hbase
    • hfile-refs
      • hfile_name_1
      • hfile_name_2
      • [...]
      • hfile_name_N
      • hfile_name_1
      • hfile_name_2
      • [...]
      • hfile_name_N
      • peer_1
      • peer_m
    • replication

The HFile Cleaner will not clean the HFiles until its entry is found in ZooKeeper. Once the HFile is replicated successfully, the ZooKeeper entry will be deleted by the replication module for that particular peer.

Configuring BulkLoad Replication on CDP

Note: If you have Kerberos setup, please check this page on how to set up cross-realm authentication.

Configuring the clusters

  1. Using Cloudera Manager, navigate on the source cluster to HBase > Configuration.
  2. Search for Advanced Configuration Snippet to find the HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml and add the following configuration properties:
    • Name: hbase.replication.bulkload.enabled
      Value: true
    • Name: hbase.replication.cluster.id
      Value: short source cluster ID, for example: source
  3. Click Save Changes.
  4. Click the Stale Service Restart icon that is next to the Service to invoke the cluster restart wizard to restart Stale Services.
  5. Using Cloudera Manager, navigate on the sink cluster to HBase > Configuration.
  6. Search for Advanced Configuration Snippet again to find the HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml and add the following configuration property:
    • Name: hbase.replication.conf.dir
      Value: path to the source cluster’s copied configuration directory, for example: /opt/cloudera/fs_conf
  7. Create the directory on the sink cluster that is defined by the hbase.replication.conf.dir property.
  8. Copy the source cluster’s HBase client configuration files (core-site.xml, hdfs-site.xml, and hbase-site.xml) to the standby cluster.
  9. The settings files should be under /run/cloudera-scm-agent/process/by default, in the master hbase service directory. You can also search for the configuration: 

 

find ./ -name 'hbase-site.xml' -print0 | xargs -0 grep 'hbase.replication.bulkload.enabled'

 

  • Copy over to replication peer:
    scp core-site.xml hdfs-site.xml hbase-site.xml root@sink.cluster.com:/opt/cloudera/fs_conf/source/
    It is important to copy the configuration files to the correct path defined by the two configuration parameters:
    hbase.replication.conf.dir/hbase.replication.cluster.id
    For example: /opt/cloudera/fs_conf/source/
  • Repeat step 8 for every RegionServer.
  • Ensure that the correct file permissions are set to the HBase user.
    For example:
    chown -R hbase:hbase /opt/cloudera/fs_conf/source
  • Click Save Changes.
  • Click the Stale Service Restart icon that is next to the service to invoke the cluster restart wizard to restart Stale Services.
  • Add the peer to the source cluster as you would in the case of normal replication.
    • For example, in the HBase shell:
      add_peer '1', CLUSTER_KEY => 'cluster-sink-1.company.com:2181:/hbase'
  • Enable replication on a table or a column family basis:
    • Table:
      enable_table_replication 'IntegrationTestBulkLoad'
    • Column family: 
      disable ‘IntegrationTestBulkLoad’
      alter‘IntegrationTestBulkLoad’, {NAME => ‘D’, REPLICATION_SCOPE => '1'}
      enable ‘IntegrationTestBulkLoad’

You should be all set now. 

 

If HBase encounters a BulkLoad entry for the associated tables in the WAL, it will initiate the bulk load replication by reading the configuration of the source cluster. There will be some info level logs like: 

  • “Loading source cluster FS client conf for cluster replicationClusterId” 
  • “Loading source cluster replicationClusterId file system configurations from xml files under directory hbase.replication.conf.dir

If there were no errors with the configuration files and connection, HBase copies over the HFile from the source cluster’s FS to local FS and you can see that in the FileSystem log. 


There will be no information trace of this in the HBase logs unless the HFile has been archived (“Failed to copy hfile from sourceHFilePath to localHFilePath. Trying to copy from hfile archive directory.”).


If the HFile has been archived, HBase will look into the archive directory. If it’s also a miss, there will be an error log ("Failed to copy hfile from
sourceHFilePath to localHFilePath. Hence ignoring this hfile from replication.”) and the bulkload replication for this HFile will be skipped.

Configurations

  • Name: hbase.replication.bulkload.enabled
    Type: Boolean
    Default: false
    Description: Enable replication for bulk loaded data.

  • Name: hbase.replication.cluster.id
    Type: String
    Default: -
    Description:Mandatory if replication for bulk loaded data is enabled. This is a unique ID to identify the source HBase cluster. It has to be defined in source cluster.
  • Name: hbase.replication.conf.dir
    Type: String
    Default: 
    HBase configuration directory (HBASE_CONF_DIR)
    Description: Represents the directory where all the active FS cluster's client configurations are defined in their respective subfolders. This configuration has to be defined in peer cluster.

  • Name: hbase.replication.source.fs.conf.provider
    Type: String
    Default: org.apache.hadoop.hbase.replication.regionserver.DefaultSourceFSConfigurationProvider
    Description: Represents the class which provides the source cluster file system configuration to peer cluster. This configuration has to be defined in peer cluster. Leave as is unless you want to implement custom configuration provider.

 

1,603 Views