Created 06-07-2016 04:08 PM
Hi,
I have a couple of questions in installing HBase, I want to install HBase in the existing cluster only.
1. Do I need to install Region Server on all Datanodes?
2. If I don't install Region Servers on all the datanodes, what will be the impact?
3. Do I need to install Phoenix Query Server on all the Region Server nodes?
4. If I install only 3 Phoenix Query Server on top of any 3 Region servers node out of 20, what will be the impact?
5. If I install only 3 Phoenix Query Server on the separate node where I don't have Region Server, what will be the impact?
Any needful help is highly appreciated and thanks in advance.
Created 06-07-2016 04:17 PM
1) You don't have to
2) Its relatively benign since HBase will have most data available on the local datanode ( because of HDFS local first write policy ) The biggest problem is that it makes your cluster config more complex. You will need two configs for hbase and non hbase nodes and potential performance implications when you run heavy workloads in the same cluster as hbase.
3) You don't HAVE to install Query server at all. Its optional normally the connection goes directly through the HBAse API and a lot of computation is client side. This is normally the better way to do things, the query server is like a proxy covering that client side functionality. In general I would prefer the normal client but there can be access issues for example ( normal api needs access to all datanodes )
4) The PQS as said is like a proxy, normally the client side computations are quite light but that can be different if you return large amounts of data from distributed region servers. So you would need to keep an eye out for the CPU performance in the PQS. In addition to that you might use some CPU resources and compete with the other services on the node.
5) Same as 4 with the exception of the additional CPU competition on those nodes.
Most of the time PQS is used because of firewall issues. If you cannot have access to all nodes it is convenient to put PQS on a couple edge nodes so you do not have to open up the full cluster to client connections.
Created 06-07-2016 04:12 PM
It's generally recommended that you install HBase RegionServers on each datanode. RegionServers can benefit from "short-circuit" reads which can only happen when the RegionServer is co-located with the Datanode.
You can run the Phoenix QueryServer on each RegionServer, but it is likely not as necessary as RegionServers. This decision should be based on the capacity requirements of your Phoenix users. One Phoenix QueryServer can easily service many users' workloads.
Having a local RegionServer is not a concern for the Phoenix QueryServer. One QueryServer will talk to many RegionServers for each query. There isn't a notion of locality.
Created 06-07-2016 04:20 PM
You obviously know that better josh, but I had the impression that on a typical cluster that is not too full most data would end up local anyway since each compaction would put at least a copy on the local datanode. I suppose that changes a bit once you install Regionservers only on a couple nodes or HDFS gets really full.
Created 06-07-2016 05:40 PM
Yep, you're right: over time, compactions will bring data local and help with locality concerns. Lots of factors play into this, but I'm more of the mindset that just having homogeneous nodes is helpful in deployments. One less thing to worry about 🙂
Created 06-07-2016 04:17 PM
1) You don't have to
2) Its relatively benign since HBase will have most data available on the local datanode ( because of HDFS local first write policy ) The biggest problem is that it makes your cluster config more complex. You will need two configs for hbase and non hbase nodes and potential performance implications when you run heavy workloads in the same cluster as hbase.
3) You don't HAVE to install Query server at all. Its optional normally the connection goes directly through the HBAse API and a lot of computation is client side. This is normally the better way to do things, the query server is like a proxy covering that client side functionality. In general I would prefer the normal client but there can be access issues for example ( normal api needs access to all datanodes )
4) The PQS as said is like a proxy, normally the client side computations are quite light but that can be different if you return large amounts of data from distributed region servers. So you would need to keep an eye out for the CPU performance in the PQS. In addition to that you might use some CPU resources and compete with the other services on the node.
5) Same as 4 with the exception of the additional CPU competition on those nodes.
Most of the time PQS is used because of firewall issues. If you cannot have access to all nodes it is convenient to put PQS on a couple edge nodes so you do not have to open up the full cluster to client connections.
Created 06-07-2016 04:25 PM
@Benjamin Leonhardi @Josh Elser
Thanks for quick response, Phoenix client will be already installed on all the Region Servers right? May I know how will the firewalls problem impact PQS?
Created 06-07-2016 04:40 PM
Phoenix is a library that is part of the hbase installation, if you mean this with client then the answer is yes.
Phoenix is nothing but a client library that translates your jdbc query into an hbase call and a server side library of functions that are used ( for example hbase coprocessors ). So if you use the normal non PQS client your client (Java) program will do some aggregations and needs access to all region servers. However its fast, simple, elegant.
If you cannot give access to all data nodes you can use PQS, you would put them on edge nodes and in this way only access the edge PQS servers with this "thin" jdbc client. The PQS server then connects to the Regionservers. It will be a bit slower since you have an extra step in the middle. However you can make sure the potentially heavy client side aggregations happen not in the client but in a dedicated server machine.
So trade off.
Created 06-07-2016 04:21 PM
1. Do I need to install Region Server on all Datanodes?
This depends a few factors, such as the total number of regions in your cluster. Each region server can host hundreds of regions. Determine the number of region servers (which may be lower than the number of Data Nodes).
2. If I don't install Region Servers on all the datanodes, what will be the impact?
Some data access (see Josh's answer above) would not be able to utilize the short circuit reads.