New Contributor
Posts: 3
Registered: ‎09-19-2016
Accepted Solution

Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dataset

[ Edited ]



I have several HBase tables defined using Avro schemas and I am trying to write a simple Java function to return the entire dataset for a given table (all records).


I'm doing something like this (assume the "Customer avro" schema has been defined):


DatasetReader<Customer> reader = null;
RandomAccessDataset<Customer> customers = Datasets.load(PropertyManager.getDatasetURI(HBaseHelper.CUSTOMER), Customer.class);

reader = customers.newReader();


According to the API docs, this should return the entire unflitered dataset. The URI method also uses the "dataset:" scheme so it is not getting a View.


What I'm seeing is that only a very small subset of the entire table is actually returned when I get a handle to the iterator - ~20 out of 15000 records that are actually in the table, which is barely 0.1%.


Please advise on how to get all records and if this is a defect with Kite - using the native HBase API is not an option because of the Kite encoding which is challenging to work with outside of Kite.


EDIT: we do not seem to see this issue on a single-node HBase, only on an HBase cluster with Kerberos auth.


New Contributor
Posts: 3
Registered: ‎09-19-2016

Re: Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dat

So ... after a long hiatus. Turns out this is actually


I was using hbase-client 0.96 with HBase 1.0.0 (CDH 5.5) and we had tables that were housing large XML payloads, which would force the bug to manifest when hbase.client.scanner.caching was a high value. 


There are multiple ways to fix this:


  1. Use hbase-client 0.98+, if you can afford to upgrade without impact
  2. Lower the value of hbase.client.scanner.caching in CM (this was what I ended up doing)
  3. Programatically, use Scan.setCaching(int) and/or Scan.setMaxResultSize() to avoid the region skipping.