Support Questions
Find answers, ask questions, and share your expertise

Accessing HBase Table through Hive is very slow

Expert Contributor

I created a Hive Table, that can access another HBase table. Something like this:

create external table hbase_50mio_200regions (rowkey String, valuelist String)
  STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
  WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:valueList")
  TBLPROPERTIES ('hbase.table.name' = 'hbase_50mio_200regions')

The access works and HiveQL queries are fast for small HBase table.

The problems are the queries to huge HBase tables. It takes very long to finish e.g. a simple count (takes >1h on 50 Mio rows HBase table with about 1300 GB of HFiles)!

Is there a way to pass start and stop key to my Hive SELECT or COUNT queries?

As I saw in the following question https://stackoverflow.com/a/40729818 the WHERE conditions won't avoid a full table scan, as they only filter the rows after iterating all rows of the HBase table.

Is this right and/or is there another way to avoid full table scans when accessing my HBase table through Hive?

1 ACCEPTED SOLUTION

Accepted Solutions

1. COUNT will result in a full table scan and hence the query is slow.
2. Where on the primary key will be fast as it will do a lookup and not a scan.
3. Where used on any column apart from the primary key will result in a HBase full table scan.
4. Analyse table once to speed up count queries. But it will not affect the where on no-primary key.

View solution in original post

1 REPLY 1

1. COUNT will result in a full table scan and hence the query is slow.
2. Where on the primary key will be fast as it will do a lookup and not a scan.
3. Where used on any column apart from the primary key will result in a HBase full table scan.
4. Analyse table once to speed up count queries. But it will not affect the where on no-primary key.

View solution in original post