Support Questions

lenu · ‎10-28-2018

Hello Everybody,

We have HBASE Table with around 10 Million records and when we integrate with Hive . It is taking more than 30 minutes to produce the results . If we try to do it in HBase it is fast. Is there anyway to manage the situation .

Or

1.can I export all the data from HBase to Hive

2.How can we avoid full scan in HBase tables from Hive .

Sorry for the basic questions

Shu_ashu · ‎10-28-2018

@Lenu K

One way to avoid full table scans is by using RowKey in your hive filter query and if you are filtering out another columns(not only row key) then it would be a lot more efficient if you export all HBase table data into Hive-ORC table then run all your queries on the exported table.

Refer to this and this links for tuning up the Queries in case of HBase-Hive table.

-

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer.

View solution in original post

Shu_ashu · ‎10-28-2018

@Lenu K

One way to avoid full table scans is by using RowKey in your hive filter query and if you are filtering out another columns(not only row key) then it would be a lot more efficient if you export all HBase table data into Hive-ORC table then run all your queries on the exported table.

Refer to this and this links for tuning up the Queries in case of HBase-Hive table.

-

If the Answer helped to resolve your issue, Click on Accept button below to accept the answer.

lenu · ‎10-28-2018

@Shu

Do you have steps to export the HBASE to Hive ORC table. I have tried the performance tuning already it didnt come up properly. Thank you very much helping

Shu_ashu · ‎10-28-2018

@Lenu K

We can do export to Hive ORC as follows:

hive> Create table <db_name>.<orc_table_name> stored as orc as select * from <db_name>.<hbase_hive_table>;

The above CTAS is generic statement even you can create a partitioned table (or) use distribute by sort by to create files in the directories.

lenu · ‎10-28-2018

Simple and Cool. However the table is updated every other hour . It is taking very long time for 900GB to CTAS. The thing is to store TB of data for the first time and then 100GB daily incremental like insert/update/delete in HBase and to make it available for Business analysts . It is taking more than 40+ minutes to retrieve a single query. Loading the data in HBASE takes only 10 to 20 minutes.Any other approach Shu.Kindly give me some spark

Shu_ashu · ‎10-28-2018

@Lenu K

1.Using Spark-Hbase Connector:

You can use Spark-Hbase connector to get data from Hbase table using Spark and store until what time you have pulled of records from the HBase table.

For the next run get the state and use it as lower bound and current time as upper bound pull the data from Hbase table and insert into Hive table.

By using this way we are not creating full snapshot of HBase table as Hive orc table instead we are incrementally loading the data into hive table and use hive table data for analytics.

2.Using Hive Merge strategy:

You can use Hive Merge strategy introduced in HDP-2.6 but for this case your hive table needs to be Transactional enabled.

merge into transactional_table using <hbase_hive_table>... etc

for more details refer to this link.

another way using hive would be using CTAS as mentioned above in comments for the first run it will take more time but from the 2 run you can only pull the incremental records from HBase table and load into Hive orc table(if you are following this approach then using spark-hbase connector will give more performence.)

3.Using Apache-Phoenix:

Using Apache phoenix to get the data from HBase table as Phoenix table will be pointed to HBase table and allows to run sql queries on top of HBase stored data.

Difference between Hive-Hbase integration vs Phoenix-Hbase integration

Cloudera Community

Support Questions

Underlying HBASE Table is taking 30+ minutes for small queries triggered from hive .