Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Accessing HBase Table through Hive is very slow

avatar
Expert Contributor

I created a Hive Table, that can access another HBase table. Something like this:

create external table hbase_50mio_200regions (rowkey String, valuelist String)
  STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
  WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf:valueList")
  TBLPROPERTIES ('hbase.table.name' = 'hbase_50mio_200regions')

The access works and HiveQL queries are fast for small HBase table.

The problems are the queries to huge HBase tables. It takes very long to finish e.g. a simple count (takes >1h on 50 Mio rows HBase table with about 1300 GB of HFiles)!

Is there a way to pass start and stop key to my Hive SELECT or COUNT queries?

As I saw in the following question https://stackoverflow.com/a/40729818 the WHERE conditions won't avoid a full table scan, as they only filter the rows after iterating all rows of the HBase table.

Is this right and/or is there another way to avoid full table scans when accessing my HBase table through Hive?

1 ACCEPTED SOLUTION

avatar

1. COUNT will result in a full table scan and hence the query is slow.
2. Where on the primary key will be fast as it will do a lookup and not a scan.
3. Where used on any column apart from the primary key will result in a HBase full table scan.
4. Analyse table once to speed up count queries. But it will not affect the where on no-primary key.

View solution in original post

1 REPLY 1

avatar

1. COUNT will result in a full table scan and hence the query is slow.
2. Where on the primary key will be fast as it will do a lookup and not a scan.
3. Where used on any column apart from the primary key will result in a HBase full table scan.
4. Analyse table once to speed up count queries. But it will not affect the where on no-primary key.