Support Questions

Find answers, ask questions, and share your expertise

How to connect hive using Python for pytest databases

avatar
New Contributor

Hello Everyone, 

I am a Data Quality professional at a company, and currently, I am developing automated data testing. However, at this initial stage, I am facing difficulties in connecting Python to Cloudera/Hue/Hive.

In manual scenarios, I usually connect to Cloudera/Hue through a web portal like this: https://demo.gethue.com/hue/editor/?type=hive.

Is there anyone here who can assist me?


Thanks you

 

6 REPLIES 6

avatar
Community Manager

@Maulz Welcome to the Cloudera Community!

To help you get the best possible solution, I have tagged our Hive experts @james_jones @ggangadharan  who may be able to assist you further.

Please keep us updated on your post, and we hope you find a satisfactory solution to your query.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Super Collaborator

@Maulz - Check this document for how to query hive from python3. You don't need Hue for this.  https://docs.cloudera.com/cdsw/1.10.5/import-data/topics/cdsw-accessing-data-from-apache-hive.html

This example is for using Kerberos, but you can change or remove the authentication settings, depending on your authentication requiremnts.

 

avatar
New Contributor

@james_jones 

I understand what you mean, but can I access it using Python with the LDAP method like Hue?

My reason Since using Kerberos, I encounter issues when creating a ticket in MIT Kerberos that conflicts with my Microsoft AD user. So, I am looking for an alternative option without using Kerberos MIT.

Is it possible?

avatar
Super Collaborator

@Maulz - You can use Knox as a proxy through the cdp-proxy-api topology to connect to Hive or Impala with basic authentication (username/password) like Hue. Using "cdp-proxy-api" assumes Knox is configured for basic authentication instead of SAML or JWT, etc. If it's not, you can manually create a new topology for basic authentication with hive. Here's how to enable Knox to expose Hive with the cdp-proxy-api if it's not configured yet (perhaps select "all" for transport mode:

https://docs.cloudera.com/cdp-private-cloud-base/7.3.1/securing-hive/topics/hive_secure_knox.html

It sounds like you have both AD and MIT KDC. If your user you want to use hive with is in the MIT KDC realm, not AD,  you can use a different krb5.conf file, or you can set up a on-way trust between AD and MIT KDC. 

You can create your own krb5.conf configured for MIT KDC (or even configured for the one-way-trust, but the trust has to be established between the AD and MIT KDCs).
export KRB5_CONFIG=/path/to/your/custom_krb5.conf

You can use the default ticket cache file in the krb5.conf
default_ccache_name = FILE:/tmp/krb5cc_%{uid}   
or set it as an environment variable (KRB5CCNAME).
Of course you'll need these set for python as well.

 

avatar
Community Manager

@Maulz Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.


Regards,

Diana Torres,
Community Moderator


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Master Mentor

@Maulz 
Connecting Python to Cloudera, Hive, and Hue involves using libraries and drivers that interface with HiveServer2  the service that allows remote clients to execute Hive queries.There are several methods to connect Python to Cloudera's ecosystem, particularly to access Hive tables through Hue. I'll detail the most common approaches.

Prerequisites

  • Cloudera/Hadoop Cluster: Ensure HiveServer2 is running on your cluster.
  • Default HiveServer2 port: 10000 (verify via Cloudera Manager).
  • Python Environment: Python 3.6+ installed.
  • Authentication: Know your authentication method:
    • Username/password (non-secure).
    • Kerberos (common in enterprise clusters).
    • LDAP.

Below is a detailed, step-by-step guide:

2. Install Required Python Libraries

Use pip to install:

Spoiler
pip install pyhive # Python interface for Hive
pip install thrift # Thrift protocol support
pip install sasl # SASL authentication (for Kerberos)
pip install thrift-sasl # SASL wrapper for Thrift
pip install pykerberos # Kerberos support (if needed)

For JDBC-based connections (alternative method):

Spoiler
pip install JayDeBeApi # JDBC bridge

3. Configure Cloudera/Hive

Via Cloudera Manager:

  1. Enable HiveServer2 and ensure it’s running.
  2. Check HiveServer2 Port (default: 10000).
  3. If using Kerberos:
  • Ensure Kerberos is configured in Cloudera.
  • Export your Kerberos keytab
Spoiler
kinit -kt <keytab_file> <principal>

Connecting Python to Cloudera/Hue/Hive

1.Using PyHive it's a Python library specifically designed to work with Hive

Spoiler

from pyhive import hive
                                           # Connect to Hive server
conn = hive.Connection(
host='cloudera_host_name',
port=10000,                         # Default HiveServer2 port
username='your_username',
password='your_password',
database='default',               # Your database name
auth='LDAP' # Or 'NONE', 'KERBEROS', 'CUSTOM' depending on your authentication setup
)
# Create a cursor
cursor = conn.cursor()
# Execute a query
cursor.execute('SELECT * FROM your_table LIMIT 10')
# Fetch results
results = cursor.fetchall()
print(results)
# Close connections
cursor.close()
conn.close()

2. Using the Impala Connection

If your Cloudera cluster uses Impala:

Spoiler

from impala.dbapi import connect
conn = connect(
host='cloudera_host_name',
port=21050, # Default Impala port
user='your_username',
password='your_password',
database='default' # Your database name
)
cursor = conn.cursor()
cursor.execute('SELECT * FROM your_table LIMIT 10')
results = cursor.fetchall()
print(results)

cursor.close()
conn.close()

3. Integration with Hue

Hue is a web UI for Hadoop, but you can programmatically interact with Hive via its APIs (limited). For direct Python-Hue integration:

  1. Use Hue’s REST API to execute queries:

Spoiler

import requests

# Hue API endpoint (replace with your Hue server URL)
url = "http://<hue_server>:8888/hue/notebook/api/execute/hive"

headers = {"Content-Type": "application/json"}
data = {
"script": "SELECT * FROM my_table",
"dialect": "hive"
}
response = requests.post(
url,
auth=('<hue_username>', '<hue_password>'),
headers=headers,
json=data
)
print(response.json())

Troubleshooting

  • Common Issues:

    • Connection Refused:
      • Verify HiveServer2 is running (netstat -tuln | grep 10000).
      • Check firewall rules.
    • Authentication Failures:
      • For Kerberos: Ensure kinit succeeded.
      • For LDAP: Validate credentials.
    • Thrift Version Mismatch:
      • Use Thrift v0.13.0 with Hive 3.x.

    Logs:

    • Check HiveServer2 logs in Cloudera Manager (/var/log/hive).

4. Best Practices

  • Use connection pooling for high-frequency queries.
  • For Kerberos, automate ticket renewal with kinit cron jobs.
  • Secure credentials using environment variables or Vault.

Happy hadooping