- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How to connect hive using Python for pytest databases
- Labels:
-
Apache Hive
-
Apache Impala
-
Cloudera Hue
Created 02-28-2025 06:12 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Everyone,
I am a Data Quality professional at a company, and currently, I am developing automated data testing. However, at this initial stage, I am facing difficulties in connecting Python to Cloudera/Hue/Hive.
In manual scenarios, I usually connect to Cloudera/Hue through a web portal like this: https://demo.gethue.com/hue/editor/?type=hive.
Is there anyone here who can assist me?
Thanks you
Created 02-28-2025 07:51 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Maulz Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our Hive experts @james_jones @ggangadharan who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created 02-28-2025 11:55 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Maulz - Check this document for how to query hive from python3. You don't need Hue for this. https://docs.cloudera.com/cdsw/1.10.5/import-data/topics/cdsw-accessing-data-from-apache-hive.html
This example is for using Kerberos, but you can change or remove the authentication settings, depending on your authentication requiremnts.
Created 03-02-2025 06:08 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I understand what you mean, but can I access it using Python with the LDAP method like Hue?
My reason Since using Kerberos, I encounter issues when creating a ticket in MIT Kerberos that conflicts with my Microsoft AD user. So, I am looking for an alternative option without using Kerberos MIT.
Is it possible?
Created 03-03-2025 05:43 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Maulz - You can use Knox as a proxy through the cdp-proxy-api topology to connect to Hive or Impala with basic authentication (username/password) like Hue. Using "cdp-proxy-api" assumes Knox is configured for basic authentication instead of SAML or JWT, etc. If it's not, you can manually create a new topology for basic authentication with hive. Here's how to enable Knox to expose Hive with the cdp-proxy-api if it's not configured yet (perhaps select "all" for transport mode:
https://docs.cloudera.com/cdp-private-cloud-base/7.3.1/securing-hive/topics/hive_secure_knox.html
It sounds like you have both AD and MIT KDC. If your user you want to use hive with is in the MIT KDC realm, not AD, you can use a different krb5.conf file, or you can set up a on-way trust between AD and MIT KDC.
You can create your own krb5.conf configured for MIT KDC (or even configured for the one-way-trust, but the trust has to be established between the AD and MIT KDCs).
export KRB5_CONFIG=/path/to/your/custom_krb5.conf
You can use the default ticket cache file in the krb5.conf
default_ccache_name = FILE:/tmp/krb5cc_%{uid}
or set it as an environment variable (KRB5CCNAME).
Of course you'll need these set for python as well.
Created 03-07-2025 07:46 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Maulz Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.
Regards,
Diana Torres,Community Moderator
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Created on 03-07-2025 03:24 PM - edited 03-07-2025 03:27 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Maulz
Connecting Python to Cloudera, Hive, and Hue involves using libraries and drivers that interface with HiveServer2 the service that allows remote clients to execute Hive queries.There are several methods to connect Python to Cloudera's ecosystem, particularly to access Hive tables through Hue. I'll detail the most common approaches.
Prerequisites
- Cloudera/Hadoop Cluster: Ensure HiveServer2 is running on your cluster.
- Default HiveServer2 port: 10000 (verify via Cloudera Manager).
- Python Environment: Python 3.6+ installed.
- Authentication: Know your authentication method:
- Username/password (non-secure).
- Kerberos (common in enterprise clusters).
- LDAP.
Below is a detailed, step-by-step guide:
2. Install Required Python Libraries
Use pip to install:
pip install thrift # Thrift protocol support
pip install sasl # SASL authentication (for Kerberos)
pip install thrift-sasl # SASL wrapper for Thrift
pip install pykerberos # Kerberos support (if needed)
For JDBC-based connections (alternative method):
3. Configure Cloudera/Hive
Via Cloudera Manager:
- Enable HiveServer2 and ensure it’s running.
- Check HiveServer2 Port (default: 10000).
- If using Kerberos:
- Ensure Kerberos is configured in Cloudera.
- Export your Kerberos keytab
Connecting Python to Cloudera/Hue/Hive
1.Using PyHive it's a Python library specifically designed to work with Hive
from pyhive import hive
# Connect to Hive server
conn = hive.Connection(
host='cloudera_host_name',
port=10000, # Default HiveServer2 port
username='your_username',
password='your_password',
database='default', # Your database name
auth='LDAP' # Or 'NONE', 'KERBEROS', 'CUSTOM' depending on your authentication setup
)
# Create a cursor
cursor = conn.cursor()
# Execute a query
cursor.execute('SELECT * FROM your_table LIMIT 10')
# Fetch results
results = cursor.fetchall()
print(results)
# Close connections
cursor.close()
conn.close()
2. Using the Impala Connection
If your Cloudera cluster uses Impala:
from impala.dbapi import connect
conn = connect(
host='cloudera_host_name',
port=21050, # Default Impala port
user='your_username',
password='your_password',
database='default' # Your database name
)
cursor = conn.cursor()
cursor.execute('SELECT * FROM your_table LIMIT 10')
results = cursor.fetchall()
print(results)
cursor.close()
conn.close()
3. Integration with Hue
Hue is a web UI for Hadoop, but you can programmatically interact with Hive via its APIs (limited). For direct Python-Hue integration:
Use Hue’s REST API to execute queries:
import requests
# Hue API endpoint (replace with your Hue server URL)
url = "http://<hue_server>:8888/hue/notebook/api/execute/hive"
headers = {"Content-Type": "application/json"}
data = {
"script": "SELECT * FROM my_table",
"dialect": "hive"
}
response = requests.post(
url,
auth=('<hue_username>', '<hue_password>'),
headers=headers,
json=data
)
print(response.json())
Troubleshooting
Common Issues:
- Connection Refused:
- Verify HiveServer2 is running (netstat -tuln | grep 10000).
- Check firewall rules.
- Authentication Failures:
- For Kerberos: Ensure kinit succeeded.
- For LDAP: Validate credentials.
- Thrift Version Mismatch:
- Use Thrift v0.13.0 with Hive 3.x.
Logs:
- Check HiveServer2 logs in Cloudera Manager (/var/log/hive).
4. Best Practices
- Use connection pooling for high-frequency queries.
- For Kerberos, automate ticket renewal with kinit cron jobs.
- Secure credentials using environment variables or Vault.
Happy hadooping
