Community Articles

twang · ‎09-06-2016

It is a very common operation to do prefix scan in HBase. For example, when reading HBase table from HBase, we may use the following table scan api:

val prefixFilter = new PrefixFilter(prefix)
val scan: Scan = new Scan()
scan.setFilter(prefixFilter)

However, the code above may appear to be very slow when scanning a large HBase table. The reason is: we need to set StartRow before using PrefixFilter. Without setting the start row properly, your HBase scan may have to begin with the very first region and waste lots of time to get to the first right place.

The recommended way is to use setRowPrefixFilter(byte[] rowPrefix), from its source code below, we can see that it helps us to set up the start row before doing table scan.

409  public Scan setRowPrefixFilter(byte[] rowPrefix) {
410    if (rowPrefix == null) {
411      setStartRow(HConstants.EMPTY_START_ROW);
412      setStopRow(HConstants.EMPTY_END_ROW);
413    } else {
414      this.setStartRow(rowPrefix);
415      this.setStopRow(calculateTheClosestNextRowKeyForPrefix(rowPrefix));
416    }
417    return this;
418  }

In addition, if you want to load HBase table into Spark, you can also use the Spark-HBase connector, which support Spark accessing HBase table as external data source. The method buildScan() can do hbase table scan and return RDD as result. Its related source code is here.

Thanks to Weiqing Yang and Ted Yu for the kind help.

Cloudera Community

Community Articles

Recommended Way to do HBase Prefix Scan through HBase Java API and HBase-Spark Connector

Apache HBase

Apache Spark