Created 04-26-2016 06:54 PM
Installed version of HDP is 2.3.4. how to load balancing spark thrift servers on HWX?
Created 04-28-2016 06:22 PM
Hello Kavita,
I have not found any doc to put a load balancer in front of STS when the cluster is kerberized (hence the post here 🙂 ).
HiveServer2
Load balancing in front of HiveServer2 in a kerberized environment can be achieved by invoking the zookeeper -- see doc here for how it works: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_hadoop-ha/content/ha-hs2-service-discove...
This worked out of the box on HDP-2.3 (all the configuration necessary was already set in hive-site), the props are
hive.server2.support.dynamic.service.discovery=true hive.server2.zookeeper.namespace=sparkhiveserver2 hive.zookeeper.quorum=zk_host1:port1,zk_host2:port2,zk_host3:port3...
Spark Thrift Server
I have replicated a similar configuration in my /etc/spark/conf/hive-site.xml but it did not work. It appears this functionality is currently being added to the Apache-Spark (so we will have to wait a bit longer for it be included in the HWX distro). See:
So for now... no load balancing for STS if the cluster is kerberized, otherwise haproxy, httpd +mod_jk or any other load balancer will probably do the work.
Cheers!
Created 04-26-2016 07:29 PM
@kavitha velaga You can use a virtual or physical load balancer and use methods ie round robin, ratio, dynamic ration, least connections, etc. Does that help?
Created 04-27-2016 09:41 PM
Is it possible to implement Load Balancing in front of multiple spark thrift servers (sts) if the cluster is kerberized? Ie: How to get around the fact that the host-specific principal has to be mentioned in the connection string. See attempts below (a kinit was done before -- the linux user has a valid TGT):
#1 Direct connection to STS (no load balancer):
Both of these connection strings work since the keytab for hive/sts_host1_fqdn@REALM is present on sts_host1.
$ beeline -u "jdbc:hive2://sts_host1:10001/default;principal=hive/sts_host1_fqdn@REALM"or
$ beeline -u "jdbc:hive2://sts_host1:10001/default;principal=hive/_HOST@REALM"
(_HOST will resolve to the sts_host1's fqdn).
#2 Connection via Load Balancer to one of the STSs:
This will only work if load balancer forwards the request to sts_host1 (since the only sts_host1 has the keytab for hive/sts_host1_fqdn@REALM).
$ beeline -u "jdbc:hive2://sts_loadbalancer_host:10001/default;principal=hive/sts_host1_fqdn@REALM" ... Error: Could not open client transport with JDBC Uri: jdbc:hive2://sts_loadbalancer_host:10001/default;principal=hive/sts_host1_fqdn@REALM: Peer indicated failure: GSS initiate failed (state=08S01,code=0)
This seemed like a good solution but does not work at all, regardless of the sts the request is forwarded to. (It seems _HOST is resolved to the load balancer fqdn -- no keytab for this. We have also tried creating a principal lb/lb_fqdn@REALM and setting the keytab on the servers in /etc/security/keytabs and using this principal in the connection string but this did not solve the issue).
$ beeline -u "jdbc:hive2://sts_loadbalancer_host:10001/default;principal=hive/_HOST@REALM" ... 16/04/26 15:37:33 [main]: ERROR transport.TSaslTransport: SASL negotiation failure javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - UNKNOWN_SERVER)]Finally, we have tried to specify Spark's principal in the connection string since it is not host-dependent, but this principal is refused as is does not 'contain 3 parts' separated by either '/' or '@' (ala: name/host_fqdn@REALM).
$ beeline -u "jdbc:hive2://sts_loadbalancer_host:10001/default;principal=spark-cluster_id@REALM" ... Kerberos principal should have 3 parts: spark-cluster_id@REALM
Thanks for posting a reply if you have mastered the kerberized-loadbalanced-spark-thrift-server dragon in the past!
Then there will be the question of session stickiness for beeline / JDBC connections sending more than one request but one problem at time... 🙂
Created 04-28-2016 03:53 PM
Thank you all. Raphael, I didn't find the documentation for this. Can you please send me the link?
Created 04-28-2016 06:22 PM
Hello Kavita,
I have not found any doc to put a load balancer in front of STS when the cluster is kerberized (hence the post here 🙂 ).
HiveServer2
Load balancing in front of HiveServer2 in a kerberized environment can be achieved by invoking the zookeeper -- see doc here for how it works: https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_hadoop-ha/content/ha-hs2-service-discove...
This worked out of the box on HDP-2.3 (all the configuration necessary was already set in hive-site), the props are
hive.server2.support.dynamic.service.discovery=true hive.server2.zookeeper.namespace=sparkhiveserver2 hive.zookeeper.quorum=zk_host1:port1,zk_host2:port2,zk_host3:port3...
Spark Thrift Server
I have replicated a similar configuration in my /etc/spark/conf/hive-site.xml but it did not work. It appears this functionality is currently being added to the Apache-Spark (so we will have to wait a bit longer for it be included in the HWX distro). See:
So for now... no load balancing for STS if the cluster is kerberized, otherwise haproxy, httpd +mod_jk or any other load balancer will probably do the work.
Cheers!
Created 04-28-2016 07:41 PM
@Raphael Vannson Great analysis. is this only true for kerberized cluster?
Created 04-28-2016 08:12 PM
These are HDPs' STS and HS2 load balancing current capabilities for a kerberized cluster I am aware of.
For a non-kerberized cluster: haproxy, httpd +mod_jk or any other soft / hard load balancer will probably do the work.
Created 04-29-2016 02:08 AM
@Raphael Vannson you mean aren't correct?