Created 12-12-2015 06:37 AM
Created 12-14-2015 09:18 PM
We did not add transactions to Hive to take workloads out of HBase. HBase is great if you want to do lots of point lookups and range scans. Hive is much better for full table scan workloads. Traditional data warehousing queries fall in the full table scan category (like, find me the year over year average sale by store). These types of operations still require transactions, like when you want to stream data in from your transactional stores, or when you need to update dimension tables. We added transactions to Hive to enable these use cases, so that Hive could be better at what it does as a data warehouse.
If your use case more closely approximates a traditional transactional work load (e.g. a shopping cart), definitely don't use Hive.
Created 12-12-2015 12:48 PM
Hive and HBase are 2 different animals. There are users who wants to stick with Hive for various reasons and that's why lot of efforts are being made to make Hive better wrt. performance and sql capabilities.
HBase is for very fast lookup and random access. Hive is for analytical queries while HBase for real-time querying
Apache Phoenix helps a lot to interact with HBase as it provides sql layer to run queries against HBase tables otherwise users have to interact with HBase shell to run lookups.
Created 12-14-2015 05:08 PM
Neeraj has rightly pointed out that Hive and HBase solves different problems. With Phoenix now HBase also has SQL layer. Features like Row-versions give additional advantage to HBase. Underlying architecture of HBase is much more granular and can scale for large no of transactions along with random access. But based my exp large scale transactions(not batch) in Hive will run into performance issues and if someone desires random access with it then HBase is the only right tool among them.
@Neeraj Sabharwal Is there any good reason to use Hive for transactions (apart from user comfort-ness)?
Created 12-14-2015 09:18 PM
We did not add transactions to Hive to take workloads out of HBase. HBase is great if you want to do lots of point lookups and range scans. Hive is much better for full table scan workloads. Traditional data warehousing queries fall in the full table scan category (like, find me the year over year average sale by store). These types of operations still require transactions, like when you want to stream data in from your transactional stores, or when you need to update dimension tables. We added transactions to Hive to enable these use cases, so that Hive could be better at what it does as a data warehouse.
If your use case more closely approximates a traditional transactional work load (e.g. a shopping cart), definitely don't use Hive.
Created 12-15-2015 06:30 AM
We use Elastic Search and HBase for real-time and batch respectively. HBase is used for long term storage. Our ingestion is more than 6 TB per day into HBase but it is only for our advanced reporting or long term data that can be queried via map-reduce. During our HBase benchmarking, real time query were not very efficient and was not applicable as a real time solution. This is why we are using Elastic Search to store our structured data and can be queried very fast. We are using range scan in HBase and performs magically fast but you have to very careful how you design your keys.