- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Kite Datasets SDK (HBase) - Datasets.load() and DatasetReader.newReader() not returning full dataset
- Labels:
-
Apache HBase
-
Kerberos
Created on ‎09-19-2016 01:09 PM - edited ‎09-16-2022 03:39 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have several HBase tables defined using Avro schemas and I am trying to write a simple Java function to return the entire dataset for a given table (all records).
I'm doing something like this (assume the "Customer avro" schema has been defined):
DatasetReader<Customer> reader = null;
RandomAccessDataset<Customer> customers = Datasets.load(PropertyManager.getDatasetURI(HBaseHelper.CUSTOMER), Customer.class);
reader = customers.newReader();
According to the API docs, this should return the entire unflitered dataset. The URI method also uses the "dataset:" scheme so it is not getting a View.
What I'm seeing is that only a very small subset of the entire table is actually returned when I get a handle to the iterator - ~20 out of 15000 records that are actually in the table, which is barely 0.1%.
Please advise on how to get all records and if this is a defect with Kite - using the native HBase API is not an option because of the Kite encoding which is challenging to work with outside of Kite.
EDIT: we do not seem to see this issue on a single-node HBase, only on an HBase cluster with Kerberos auth.
Created ‎12-08-2016 08:13 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So ... after a long hiatus. Turns out this is actually https://issues.apache.org/jira/browse/HBASE-13262
I was using hbase-client 0.96 with HBase 1.0.0 (CDH 5.5) and we had tables that were housing large XML payloads, which would force the bug to manifest when hbase.client.scanner.caching was a high value.
There are multiple ways to fix this:
- Use hbase-client 0.98+, if you can afford to upgrade without impact
- Lower the value of hbase.client.scanner.caching in CM (this was what I ended up doing)
- Programatically, use Scan.setCaching(int) and/or Scan.setMaxResultSize() to avoid the region skipping.
Created ‎12-08-2016 08:13 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So ... after a long hiatus. Turns out this is actually https://issues.apache.org/jira/browse/HBASE-13262
I was using hbase-client 0.96 with HBase 1.0.0 (CDH 5.5) and we had tables that were housing large XML payloads, which would force the bug to manifest when hbase.client.scanner.caching was a high value.
There are multiple ways to fix this:
- Use hbase-client 0.98+, if you can afford to upgrade without impact
- Lower the value of hbase.client.scanner.caching in CM (this was what I ended up doing)
- Programatically, use Scan.setCaching(int) and/or Scan.setMaxResultSize() to avoid the region skipping.
