Created 08-08-2018 08:40 AM
I created a Hive Table, that can access another HBase table. Something like this:
create external table hbase_50mio_200regions (rowkey String, valuelist String) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:valueList") TBLPROPERTIES ('hbase.table.name' = 'hbase_50mio_200regions')
The access works and HiveQL queries are fast for small HBase table.
The problems are the queries to huge HBase tables. It takes very long to finish e.g. a simple count (takes >1h on 50 Mio rows HBase table with about 1300 GB of HFiles)!
Is there a way to pass start and stop key to my Hive SELECT or COUNT queries?
As I saw in the following question https://stackoverflow.com/a/40729818 the WHERE conditions won't avoid a full table scan, as they only filter the rows after iterating all rows of the HBase table.
Is this right and/or is there another way to avoid full table scans when accessing my HBase table through Hive?
Created 08-09-2018 07:19 PM
1. COUNT will result in a full table scan and hence the query is slow.
2. Where on the primary key will be fast as it will do a lookup and not a scan.
3. Where used on any column apart from the primary key will result in a HBase full table scan.
4. Analyse table once to speed up count queries. But it will not affect the where on no-primary key.
Created 08-09-2018 07:19 PM
1. COUNT will result in a full table scan and hence the query is slow.
2. Where on the primary key will be fast as it will do a lookup and not a scan.
3. Where used on any column apart from the primary key will result in a HBase full table scan.
4. Analyse table once to speed up count queries. But it will not affect the where on no-primary key.