Community Articles
Find and share helpful community-sourced technical articles
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.
Labels (2)
New Contributor

It is a very common operation to do prefix scan in HBase. For example, when reading HBase table from HBase, we may use the following table scan api:

val prefixFilter = new PrefixFilter(prefix)
val scan: Scan = new Scan()
scan.setFilter(prefixFilter)

However, the code above may appear to be very slow when scanning a large HBase table. The reason is: we need to set StartRow before using PrefixFilter. Without setting the start row properly, your HBase scan may have to begin with the very first region and waste lots of time to get to the first right place.

The recommended way is to use setRowPrefixFilter(byte[] rowPrefix), from its source code below, we can see that it helps us to set up the start row before doing table scan.

409  public Scan setRowPrefixFilter(byte[] rowPrefix) {
410    if (rowPrefix == null) {
411      setStartRow(HConstants.EMPTY_START_ROW);
412      setStopRow(HConstants.EMPTY_END_ROW);
413    } else {
414      this.setStartRow(rowPrefix);
415      this.setStopRow(calculateTheClosestNextRowKeyForPrefix(rowPrefix));
416    }
417    return this;
418  }

In addition, if you want to load HBase table into Spark, you can also use the Spark-HBase connector, which support Spark accessing HBase table as external data source. The method buildScan() can do hbase table scan and return RDD as result. Its related source code is here.

Thanks to Weiqing Yang and Ted Yu for the kind help.

6,851 Views
Don't have an account?
Coming from Hortonworks? Activate your account here
Version history
Revision #:
1 of 1
Last update:
‎09-06-2016 11:57 PM
Updated by:
 
Contributors
Top Kudoed Authors