Support Questions

avijeetd · ‎01-09-2017

Hi All,

I am looking for what is the recommendation around setting up HIVE and HBASE clusters, specifically is it recommended to have separate HIVE and HBASE clusters or put both in same clusters.

just not from a capacity perspective but overall - what would be a overall design philosophy perspective.

As many HDP components use HBASE now such as Atlas, is there a recommendation to set up a small HBASE cluster within a HIVE cluster.

Thanks,

Avijeet

mqureshi · ‎01-25-2017

@Avijeet Dash

I'll try to answer in detail and before I answer, let me give you some context.

Think about the traditional database world (aka legacy world). Imagine you have a large Oracle/MySQL/DB2 database where you are bringing your transaction data. These are live transactions and you have thousands of transactions per second. This is very time sensitive for you and has been tuned and sized precisely down to milli second level. You know exactly how many transactions happen every second and any changes to the volume of data ingested or the type of queries run can impact your system. You monitor this very closely. Now imagine, for a second that you don't have an EDW. Your business comes and says, we would like to run some queries to gain business insights from this data. You say, hold on. I can't let you run these kind of queries against my transactional system. It is very precise and size appropriately for what it's doing. If you start running the kind of queries you want to run (multiple joins, aggregation etc), then you are going to take away resources meant for my transactions and blow up all my SLAs. Sorry, I can't let you do that. That's when you suggest business that what they need is a separate database where they import all this data and model it differently (may be lot more indexes and more denormalized). Then they can move data on a nightly basis (ETL) from your transaction system when load is low and run the type of queries they want to run on their own separate database (let's call it EDW).

HBase is that transactional system and Hive is very similar if not exactly that data warehouse. HBase today does not run with YARN (yes, Slider, but I haven't seen a production deployment yet). This means managing those resources to make sure that HBase SLAs are not impacted if someone runs a big Hive query (think Tableau generating a ridiculous query) is a difficult task.

So the answer to your question is how sensitive is your HBase? Do you care if someone slows down your HBase or vice versa (HBase slowing down Hive). If you can manage this aspect, then there is nothing wrong in running both in same cluster. A lot of customers do that - as long as you know what you are doing. However, if you have tight SLAs, then may be you want to consider separate clusters. It really depends on your use case.

View solution in original post

mqureshi · ‎01-25-2017

@Avijeet Dash

I'll try to answer in detail and before I answer, let me give you some context.

Think about the traditional database world (aka legacy world). Imagine you have a large Oracle/MySQL/DB2 database where you are bringing your transaction data. These are live transactions and you have thousands of transactions per second. This is very time sensitive for you and has been tuned and sized precisely down to milli second level. You know exactly how many transactions happen every second and any changes to the volume of data ingested or the type of queries run can impact your system. You monitor this very closely. Now imagine, for a second that you don't have an EDW. Your business comes and says, we would like to run some queries to gain business insights from this data. You say, hold on. I can't let you run these kind of queries against my transactional system. It is very precise and size appropriately for what it's doing. If you start running the kind of queries you want to run (multiple joins, aggregation etc), then you are going to take away resources meant for my transactions and blow up all my SLAs. Sorry, I can't let you do that. That's when you suggest business that what they need is a separate database where they import all this data and model it differently (may be lot more indexes and more denormalized). Then they can move data on a nightly basis (ETL) from your transaction system when load is low and run the type of queries they want to run on their own separate database (let's call it EDW).

HBase is that transactional system and Hive is very similar if not exactly that data warehouse. HBase today does not run with YARN (yes, Slider, but I haven't seen a production deployment yet). This means managing those resources to make sure that HBase SLAs are not impacted if someone runs a big Hive query (think Tableau generating a ridiculous query) is a difficult task.

So the answer to your question is how sensitive is your HBase? Do you care if someone slows down your HBase or vice versa (HBase slowing down Hive). If you can manage this aspect, then there is nothing wrong in running both in same cluster. A lot of customers do that - as long as you know what you are doing. However, if you have tight SLAs, then may be you want to consider separate clusters. It really depends on your use case.

avijeetd · ‎01-27-2017

Thanks @mqureshi - that answers my question. However a number of components have started using HBASE as a meta-data store such as Atlas, Falcon etc. How to see these use cases?