I am trying to run a Hive query with pyspark. I am using Hortonworks so I need to use the Hive WarehouseConnector.
Running one or even multiple queries is easy and works. My problem is that I want to issue set commands before. For instance to set the dag name in tez ui:
set hive.query.name=something relevant
or to set up some memory configuration:
set hive.tez.container.size = 8192
For these statements to take effect, they need to run on the same session than the main query and that's my issue.
I tried 2 ways:
The first one was to generate a new hive session for each query, with a properly setup url eg.:
url='jdbc:hive2://hiveserver:10000/default?hive.query.name=relevant' builder = HiveWarehouseSession.session(self.spark) builder.hs2url(url) hive = builder.build() hive.execute("select * from whatever")
It works well for the first query, but the same url is reused for the next one (even if I try to manually delete builder and hive), so does not work.
The second way is to set
globally in the spark thrift server. his does seem to work, but I find it a shame to limit the global spark thrift server for the benefit of one application only.
Is there a way to achieve what I am looking for? Maybe there could be a way to pin a query to one executor, so hopefully one session?