Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dataset

jastang — Fri, 16 Sep 2022 10:39:59 GMT

Hello,

I have several HBase tables defined using Avro schemas and I am trying to write a simple Java function to return the entire dataset for a given table (all records).

I'm doing something like this (assume the "Customer avro" schema has been defined):

DatasetReader<Customer> reader = null;

RandomAccessDataset<Customer> customers = Datasets.load(PropertyManager.getDatasetURI(HBaseHelper.CUSTOMER), Customer.class);

reader = customers.newReader();

According to the API docs, this should return the entire unflitered dataset. The URI method also uses the "dataset:" scheme so it is not getting a View.

What I'm seeing is that only a very small subset of the entire table is actually returned when I get a handle to the iterator - ~20 out of 15000 records that are actually in the table, which is barely 0.1%.

Please advise on how to get all records and if this is a defect with Kite - using the native HBase API is not an option because of the Kite encoding which is challenging to work with outside of Kite.

EDIT: we do not seem to see this issue on a single-node HBase, only on an HBase cluster with Kerberos auth.

Re: Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dat

jastang — Fri, 09 Dec 2016 04:13:47 GMT

So ... after a long hiatus. Turns out this is actually https://issues.apache.org/jira/browse/HBASE-13262

I was using hbase-client 0.96 with HBase 1.0.0 (CDH 5.5) and we had tables that were housing large XML payloads, which would force the bug to manifest when hbase.client.scanner.caching was a high value.

There are multiple ways to fix this:

Use hbase-client 0.98+, if you can afford to upgrade without impact
Lower the value of hbase.client.scanner.caching in CM (this was what I ended up doing)
Programatically, use Scan.setCaching(int) and/or Scan.setMaxResultSize() to avoid the region skipping.

question Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dataset in Archives of Support Questions (Read Only)

Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dataset

Re: Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dat