Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to connect to multiple hosts using pyhive or impala

avatar
New Contributor
I already have a working connection through ODBC using Cloudera ODBC Driver for Apache Hive, where I have my DSN set and all I need is to call pyodbc.connect(f"DSN={mydsn}", autocommit=True)
I'd like to use SQLAlchemy, but I'm struggling how to create a working connection url for multiple hosts.
I was trying to base my idea on this guide where I found connection string jdbc:hive2://datanode1:2181,master1:2181,master2:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2
However, this
from sqlalchemy import create_engine
query = """SELECT TOP 10 * from eb.mobile_sa"""
conn_url = f'hive://{UID}@host1:2181,host2:2181,host3:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2'
engine = create_engine(conn_url)
with engine.connect() as conn:
    df = pd.read_sql(query, conn)

throws an error 

invalid literal for int() with base 10: '2181,host2:2181,host3:2181 etc.

 

I've also considered cloudera's impyla since Pyhive seems to be currently unsupported (as per its github page).

I'm also unable to connect to a single host

def conn():
return connect(host=host1,
port=10000,
                         timeout=20,
                         user=user)
engine = sqlalchemy.create_engine('impala://', creator=conn)
with engine.connect() as s:
    query = """SELECT TOP 10 * from eb.mobile_sa""""
    df = pd.read_sql(query,s)

throws an error

thrift.transport.TTransport.TTransportException: Could not connect to any of [('10.151.50.42', 10000)]

and that's okay, since I'm not using the default port 10000. However, using port 2181 throws this error:

sqlalchemy.exc.DBAPIError: (impala.error.HiveServer2Error) Failed after retrying 3 times
[SQL: SELECT TOP 10 * from eb.mobile_sa]

Could you please advise?

1 REPLY 1

avatar
Expert Contributor

Hi @Kropiciel 

To connect ODBC driver with multiple HiveServers, you can configure High Availability for Hive using LB, refer below doc for same. The use the LB hostname and port in the connection string

https://docs.cloudera.com/cdp-private-cloud-base/7.1.9/configuring-apache-hive/topics/hive-ha-loadba...

Let us know if this helps.

 

As for the python connection use below code and check how it goes.

 

CODE:

---------

from sqlalchemy import create_engine

#Input Information

host = hive_hostname

port = 10000

schema = schema_name

table = table_name

 

#Execution

engine = create_engine(f'hive://{host}:{port}/{schema}')

engine.execute(QUERY)