09-19-2016 01:09 PM - edited 09-19-2016 02:43 PM
I have several HBase tables defined using Avro schemas and I am trying to write a simple Java function to return the entire dataset for a given table (all records).
I'm doing something like this (assume the "Customer avro" schema has been defined):
DatasetReader<Customer> reader = null;
RandomAccessDataset<Customer> customers = Datasets.load(PropertyManager.getDatasetURI(HBaseHelper.CUSTOMER), Customer.class);
reader = customers.newReader();
According to the API docs, this should return the entire unflitered dataset. The URI method also uses the "dataset:" scheme so it is not getting a View.
What I'm seeing is that only a very small subset of the entire table is actually returned when I get a handle to the iterator - ~20 out of 15000 records that are actually in the table, which is barely 0.1%.
Please advise on how to get all records and if this is a defect with Kite - using the native HBase API is not an option because of the Kite encoding which is challenging to work with outside of Kite.
EDIT: we do not seem to see this issue on a single-node HBase, only on an HBase cluster with Kerberos auth.
12-08-2016 08:13 PM
So ... after a long hiatus. Turns out this is actually https://issues.apache.org/jira/browse/HBASE-13262
I was using hbase-client 0.96 with HBase 1.0.0 (CDH 5.5) and we had tables that were housing large XML payloads, which would force the bug to manifest when hbase.client.scanner.caching was a high value.
There are multiple ways to fix this: