Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Run multiple hive query from pyspark on the same session

Run multiple hive query from pyspark on the same session

Expert Contributor

I am trying to run a Hive query with pyspark. I am using Hortonworks so I need to use the Hive WarehouseConnector.

Running one or even multiple queries is easy and works. My problem is that I want to issue set commands before. For instance to set the dag name in tez ui:

set hive.query.name=something relevant

or to set up some memory configuration:

set hive.tez.container.size = 8192

For these statements to take effect, they need to run on the same session than the main query and that's my issue.

 

I tried 2 ways:

 

The first one was to generate a new hive session for each query, with a properly setup url eg.:

 

url='jdbc:hive2://hiveserver:10000/default?hive.query.name=relevant'
builder = HiveWarehouseSession.session(self.spark)
builder.hs2url(url)
hive = builder.build()
hive.execute("select * from whatever")

 

It works well for the first query, but the same url is reused for the next one (even if I try to manually delete builder and hive), so does not work.

 

The second way is to set

spark.sql.hive.thriftServer.singleSession=true

globally in the spark thrift server. his does seem to work, but I find it a shame to limit the global spark thrift server for the benefit of one application only.

 

Is there a way to achieve what I am looking for? Maybe there could be a way to pin a query to one executor, so hopefully one session?

1 REPLY 1
Highlighted

Re: Run multiple hive query from pyspark on the same session

Guru
Sorry that I am not answering your question directly, but I am wondering why you want to run hive query through pyspark? Why don't you just use SparkSQL?

Cheers
Eric