Reply
Highlighted
New Contributor
Posts: 3
Registered: ‎09-19-2016
Accepted Solution

Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dataset

[ Edited ]

Hello,

 

I have several HBase tables defined using Avro schemas and I am trying to write a simple Java function to return the entire dataset for a given table (all records).

 

I'm doing something like this (assume the "Customer avro" schema has been defined):

 

DatasetReader<Customer> reader = null;
RandomAccessDataset<Customer> customers = Datasets.load(PropertyManager.getDatasetURI(HBaseHelper.CUSTOMER), Customer.class);

reader = customers.newReader();

 

According to the API docs, this should return the entire unflitered dataset. The URI method also uses the "dataset:" scheme so it is not getting a View.

 

What I'm seeing is that only a very small subset of the entire table is actually returned when I get a handle to the iterator - ~20 out of 15000 records that are actually in the table, which is barely 0.1%.

 

Please advise on how to get all records and if this is a defect with Kite - using the native HBase API is not an option because of the Kite encoding which is challenging to work with outside of Kite.

 

EDIT: we do not seem to see this issue on a single-node HBase, only on an HBase cluster with Kerberos auth.

 

New Contributor
Posts: 3
Registered: ‎09-19-2016

Re: Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dat

So ... after a long hiatus. Turns out this is actually https://issues.apache.org/jira/browse/HBASE-13262

 

I was using hbase-client 0.96 with HBase 1.0.0 (CDH 5.5) and we had tables that were housing large XML payloads, which would force the bug to manifest when hbase.client.scanner.caching was a high value. 

 

There are multiple ways to fix this:

 

  1. Use hbase-client 0.98+, if you can afford to upgrade without impact
  2. Lower the value of hbase.client.scanner.caching in CM (this was what I ended up doing)
  3. Programatically, use Scan.setCaching(int) and/or Scan.setMaxResultSize() to avoid the region skipping.

 

 

Announcements
The Kite SDK is a collection of docs, sample code, APIs, and tools to make Hadoop application development faster. Learn more at http://kitesdk.org.