Support Questions

Find answers, ask questions, and share your expertise

Hive and Hbase table

avatar
Expert Contributor

Can I implement such scenario:

1.One data copy

2.UPDATE/DELETE/INSERT in Hbase

3.Query Table in Hive.

4.How about the performance of query in hive compare to ORC?

5.Or just turn on ACID in HIVE to implement above?

Thanks

1 ACCEPTED SOLUTION

avatar
Super Guru
What is your use case? Type of data? Hive Acid performance will likely be slower than Hive on top of HBase specifically if you access data using HBase row key.
Before I recommend Hive/ORC vs HBase, I'd like to understand your use case better. Here is what I say about HBase:
When to use HBase:
•Storing large amounts of data (TB/PB)
•High throughput for a large number of requests
•Storing unstructured or variable column data
•Big Data with random read and writes
•Well Suited for sparse rows where the number of column varies
•Highly Available, Scalable (since it runs on HDFS)
When NOT to use HBase: •Only use with Big Data problems
•If you have data for only one or two nodes, HBase is likely not the tool you should be using to begin with.
•Read straight through files
•Write all at once or append new files
•Not random reads or writes
•Access patterns of the data are ill-defined

View solution in original post

6 REPLIES 6

avatar

Hello

You can definitely upload data in hdfs and then in Hbase through Hive. You can also query Hbase through Hive using the hbase storagehandler.

Please refer here for more detailed explanation: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration

If this is derived from a Hive table it has a schema so I would also consider the Hive / Phoenix storage handler:https://phoenix.apache.org/hive_storage_handler.html

On a performance standpoint overall querying Hbase through Hive should be less performant then querying ORC tables. This beeing said it depends on the query pattern and what the use case is.

regards

avatar
Expert Contributor

Thanks @nmaillard And how about the ACID performance?

avatar
Expert Contributor

our HDP 2.5's phoenix version is V4.7

avatar
Super Guru
What is your use case? Type of data? Hive Acid performance will likely be slower than Hive on top of HBase specifically if you access data using HBase row key.
Before I recommend Hive/ORC vs HBase, I'd like to understand your use case better. Here is what I say about HBase:
When to use HBase:
•Storing large amounts of data (TB/PB)
•High throughput for a large number of requests
•Storing unstructured or variable column data
•Big Data with random read and writes
•Well Suited for sparse rows where the number of column varies
•Highly Available, Scalable (since it runs on HDFS)
When NOT to use HBase: •Only use with Big Data problems
•If you have data for only one or two nodes, HBase is likely not the tool you should be using to begin with.
•Read straight through files
•Write all at once or append new files
•Not random reads or writes
•Access patterns of the data are ill-defined

avatar
Expert Contributor

@mqureshi, Thanks for your response.

We using sqoop data from oracle tables to HDFS( HIVE external table), and then insert into ORC table in HIVE to support data analytics. And our HIVE currently not turn ACID on. Most of tables size currently less than 1TBs. Now there is requirement to update the imported table data in HIVE, because of the source data updated. I seached on web and found it seems ACID are not very good on performance when update and the ACID tables are also not recognized outside of HIVE(e.g. SPARK). We are looking for a most performance approach for it. So I considered to implemented it by using hbase storagehandler or sqoop merge ?

avatar
Super Guru

@Huahua Wei

HBaseStoragehandler is what is required to read HBase tables. At the end of the day, you first have to create and manage HBase and then use Hive. Since, you are going to be doing updates, this might be the best way to go about it but I would strongly recommend to look at the following approach. The reason is probably my personal preference of not using HBase until required as it is complex and skill set required to successfully implement is difficult to find. That being said, in your use case, if you don't like the following approach, I'd prefer HBase over Hive ACID.

http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/