Created 02-28-2025 06:12 AM
Hello Everyone,
I am a Data Quality professional at a company, and currently, I am developing automated data testing. However, at this initial stage, I am facing difficulties in connecting Python to Cloudera/Hue/Hive.
In manual scenarios, I usually connect to Cloudera/Hue through a web portal like this: https://demo.gethue.com/hue/editor/?type=hive.
Is there anyone here who can assist me?
Thanks you
Created 02-28-2025 07:51 AM
@Maulz Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our Hive experts @james_jones @ggangadharan who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created 02-28-2025 11:55 AM
@Maulz - Check this document for how to query hive from python3. You don't need Hue for this. https://docs.cloudera.com/cdsw/1.10.5/import-data/topics/cdsw-accessing-data-from-apache-hive.html
This example is for using Kerberos, but you can change or remove the authentication settings, depending on your authentication requiremnts.
Created 03-02-2025 06:08 AM
I understand what you mean, but can I access it using Python with the LDAP method like Hue?
My reason Since using Kerberos, I encounter issues when creating a ticket in MIT Kerberos that conflicts with my Microsoft AD user. So, I am looking for an alternative option without using Kerberos MIT.
Is it possible?
Created 03-03-2025 05:43 AM
@Maulz - You can use Knox as a proxy through the cdp-proxy-api topology to connect to Hive or Impala with basic authentication (username/password) like Hue. Using "cdp-proxy-api" assumes Knox is configured for basic authentication instead of SAML or JWT, etc. If it's not, you can manually create a new topology for basic authentication with hive. Here's how to enable Knox to expose Hive with the cdp-proxy-api if it's not configured yet (perhaps select "all" for transport mode:
https://docs.cloudera.com/cdp-private-cloud-base/7.3.1/securing-hive/topics/hive_secure_knox.html
It sounds like you have both AD and MIT KDC. If your user you want to use hive with is in the MIT KDC realm, not AD, you can use a different krb5.conf file, or you can set up a on-way trust between AD and MIT KDC.
You can create your own krb5.conf configured for MIT KDC (or even configured for the one-way-trust, but the trust has to be established between the AD and MIT KDCs).
export KRB5_CONFIG=/path/to/your/custom_krb5.conf
You can use the default ticket cache file in the krb5.conf
default_ccache_name = FILE:/tmp/krb5cc_%{uid}
or set it as an environment variable (KRB5CCNAME).
Of course you'll need these set for python as well.
Created 03-07-2025 07:46 AM
@Maulz Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future. Thanks.
Regards,
Diana Torres,Created on 03-07-2025 03:24 PM - edited 03-07-2025 03:27 PM
@Maulz
Connecting Python to Cloudera, Hive, and Hue involves using libraries and drivers that interface with HiveServer2 the service that allows remote clients to execute Hive queries.There are several methods to connect Python to Cloudera's ecosystem, particularly to access Hive tables through Hue. I'll detail the most common approaches.
Below is a detailed, step-by-step guide:
Use pip to install:
For JDBC-based connections (alternative method):
Connecting Python to Cloudera/Hue/Hive
1.Using PyHive it's a Python library specifically designed to work with Hive
from pyhive import hive
# Connect to Hive server
conn = hive.Connection(
host='cloudera_host_name',
port=10000, # Default HiveServer2 port
username='your_username',
password='your_password',
database='default', # Your database name
auth='LDAP' # Or 'NONE', 'KERBEROS', 'CUSTOM' depending on your authentication setup
)
# Create a cursor
cursor = conn.cursor()
# Execute a query
cursor.execute('SELECT * FROM your_table LIMIT 10')
# Fetch results
results = cursor.fetchall()
print(results)
# Close connections
cursor.close()
conn.close()
If your Cloudera cluster uses Impala:
from impala.dbapi import connect
conn = connect(
host='cloudera_host_name',
port=21050, # Default Impala port
user='your_username',
password='your_password',
database='default' # Your database name
)
cursor = conn.cursor()
cursor.execute('SELECT * FROM your_table LIMIT 10')
results = cursor.fetchall()
print(results)
cursor.close()
conn.close()
3. Integration with Hue
Hue is a web UI for Hadoop, but you can programmatically interact with Hive via its APIs (limited). For direct Python-Hue integration:
Use Hue’s REST API to execute queries:
import requests
# Hue API endpoint (replace with your Hue server URL)
url = "http://<hue_server>:8888/hue/notebook/api/execute/hive"
headers = {"Content-Type": "application/json"}
data = {
"script": "SELECT * FROM my_table",
"dialect": "hive"
}
response = requests.post(
url,
auth=('<hue_username>', '<hue_password>'),
headers=headers,
json=data
)
print(response.json())
Troubleshooting
Logs:
Happy hadooping