Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dataset

avatar
Explorer

Hello,

 

I have several HBase tables defined using Avro schemas and I am trying to write a simple Java function to return the entire dataset for a given table (all records).

 

I'm doing something like this (assume the "Customer avro" schema has been defined):

 

DatasetReader<Customer> reader = null;
RandomAccessDataset<Customer> customers = Datasets.load(PropertyManager.getDatasetURI(HBaseHelper.CUSTOMER), Customer.class);

reader = customers.newReader();

 

According to the API docs, this should return the entire unflitered dataset. The URI method also uses the "dataset:" scheme so it is not getting a View.

 

What I'm seeing is that only a very small subset of the entire table is actually returned when I get a handle to the iterator - ~20 out of 15000 records that are actually in the table, which is barely 0.1%.

 

Please advise on how to get all records and if this is a defect with Kite - using the native HBase API is not an option because of the Kite encoding which is challenging to work with outside of Kite.

 

EDIT: we do not seem to see this issue on a single-node HBase, only on an HBase cluster with Kerberos auth.

 

1 ACCEPTED SOLUTION

avatar
Explorer

So ... after a long hiatus. Turns out this is actually https://issues.apache.org/jira/browse/HBASE-13262

 

I was using hbase-client 0.96 with HBase 1.0.0 (CDH 5.5) and we had tables that were housing large XML payloads, which would force the bug to manifest when hbase.client.scanner.caching was a high value. 

 

There are multiple ways to fix this:

 

  1. Use hbase-client 0.98+, if you can afford to upgrade without impact
  2. Lower the value of hbase.client.scanner.caching in CM (this was what I ended up doing)
  3. Programatically, use Scan.setCaching(int) and/or Scan.setMaxResultSize() to avoid the region skipping.

 

 

View solution in original post

1 REPLY 1

avatar
Explorer

So ... after a long hiatus. Turns out this is actually https://issues.apache.org/jira/browse/HBASE-13262

 

I was using hbase-client 0.96 with HBase 1.0.0 (CDH 5.5) and we had tables that were housing large XML payloads, which would force the bug to manifest when hbase.client.scanner.caching was a high value. 

 

There are multiple ways to fix this:

 

  1. Use hbase-client 0.98+, if you can afford to upgrade without impact
  2. Lower the value of hbase.client.scanner.caching in CM (this was what I ended up doing)
  3. Programatically, use Scan.setCaching(int) and/or Scan.setMaxResultSize() to avoid the region skipping.