for enabling Hive on Spark. I am just starting and have my first question.
it says "For Hive to work on Spark, you must deploy Spark gateway roles on the same machine that hosts HiveServer2. Otherwise, Hive on Spark cannot read from Spark configurations and cannot submit Spark jobs. " . I dont think I understand this requirement completly
I have 1+3 node cluster, where Spark Gateway is running on all nodes while HiveServer2 is only running on master node.
Do I need HiveServer2 running on all four nodes?
if no, does this mean I can only submit jobs through node 1 / master since HiveServer2 exists only on node 1/master?
I'll try to shed some light on the mystical art of using HoS (Hive on Spark).
First of all, no, you don't need to install HS2 (HiveServer2) on all nodes in the cluster. Having Spark Gateway role on all nodes is a good solution, the docs just want to make sure you have one on the same node the HS2 is running.
As for the other question, that's also a negative. HS2 user server-client architecture, so basically you can run the client (beeline) on any node in the cluster and connect to HS2 to submit the query for execution. The job (query) will then get executed on random nodes in the cluster.