Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Expert Contributor

This article outlines the steps needed to setup ODBC access to Hive via Apache Knox from a Linux workstation. While there seems to be some "reasonable" documentation other there on setting up this access from Windows, it took quite some time to figure out how to do this from Linux.

For this example, I am installing the Hortonworks Linux ODBC driver on CentOS Linux release 7.2.1511 then verifying it works using both the isql ODBC command line tool and PyODBC.

Installation Steps

1) Install the UnixODBC Dependencies

sudo yum install unixODBC 

2) Download and Install Hortonworks ODBC Linux Driver

I developed this guide using the below version of the driver. Note that the following URL is for the CentOS 7 version of the driver. If you are running an older version of the OS, then visit https://hortonworks.com/downloads/ and use the CentOS 6 driver. Also note that the below URL will likely change as new versions of the driver are released so it is a good idea to check the Hortonworks download page for the latest version when you run this procedure. I can tell you that a college used this procedure on RHEL 6.4 and used the CentOS 6 driver and it worked fine.

wget http://public-repo-1.hortonworks.com/HDP/hive-odbc/2.1.5.1006/centos7/hive-odbc-native-2.1.5.1006-1....
sudo yum install hive-odbc-native-2.1.5.1006-1.el7.x86_64.rpm 

3) Create $HOME/.odbc.ini

You don't necessarily need this file. For example, you could build up a DSN string in code in a Python program that you use to connect to Hive using PyODBC. But, you can also just specify a DSN name that PyODBC then maps to an entry in your $HOME/.odbc.ini file. So, either way, here are the settings that I used to build up a minimal $HOME/.odbc.ini used in this guide.

Simple copy the following values into $HOME/.odbc.ini. If you want to know the details of what each field means and its optional values, you can find it all at the end of the ODBC installation document.

Note that the following sample entries have some fields that you MUST change that are environment and user specific. Namely the host, trustStorePassword, and ClientCert fields. All other fields should be left as they are.

[ODBC Data Sources]
# The below key 'Knox' can be anything but it MUST MATCH the key name used
# in the section below. The right side is just descriptive and can be anything
Knox=DSN for access to the production cluster using Beeline via Knox
[Knox]
# Optional description of this DSN
Description=Hortonworks Knox QA DSN
# Where the actual ODBC driver library is loaded from. This is the default
# location where Hortonworks installs the driver
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
# The hostname of the server running the Knox gateway
Host=api01.quasar.local
# The port used to connect to the Knox gateway
port=8443
# The same httppath value used in a typical beeline connection string that
# connects via Knox. For our cluster, it is always "quasar/jupstats/hive"
HttpPath=quasar/jupstats/hive
# The database you want to be on once beeline connects to Hive. This is basically
# like executing the HQL "use my_database" command after the connection is made.
 schema=default
# Must be 0 because we want to connect directly to Hiveserver2 inside the cluster.
ServiceDiscoveryMode=0
# Must be 2 because we are using hiveserver2 inside the cluster.
HiveServerType=2
# This indicates we need to provide a username and password in the connect string
AuthMech=3
# This indicates the transport mode for Hive is HTTP, which is required for Knox
ThriftTransport=2
# Indicates the connection to Knox uses SSL. Note that our Hive configuration has
# hive.server2.use.SSL=false but, the connection to hiverserver2 is via Knox and,
# to connect to Knox, you must use SSL. So, the ODBC driver is first connecting
# to Knox hence we need SSL=1 here. If you were using ODBC to directly connect
# to hiveserver2 (not via Knox), then, for the configuration we have in our cluster,
# this value would need to be 0
SSL=1
# But NOT two-way SSL meaning no client side authentication - only server-side
TwoWaySSL=0
# These two are the same values used in the beeline connect string. The password is
# the password you used to add the Knox certificate to your local user key store.
# The client certifificate path is just the path to that key store
trustStorePassword=<your-trust-store-password>
ClientCert=/home/mpetronic/keystore/keystore.jks
# Must be 1 to support using self-signed certificates. Since we generated and
# self-signed certificate in our cluster, we need this.
AllowSelfSignedServerCert=1
# This prevents ODBC from doing query preprocessing which is not required since we are
# running against a Hive-aware end point
UseNativeQuery=1 

Regarding the use of the UseNativeQuery=1 option, the ODBC docs say:

This option specifies whether the driver uses native HiveQL queries, or converts them into an equivalent form in HiveQL.

  • Enabled (1): The driver does not transform the queries emitted by an application, and executes HiveQL queries directly.
  • Disabled (0): The driver transforms the queries emitted by an application and converts them into an equivalent form in HiveQL.

Note: If the application is Hive-aware and already emits HiveQL, then enable this option to avoid the extra overhead of query transformation.

Hive logs the queries that are run in various places like log files and in the Tez View. Admins need to be able to read these to help debug query execution problems. When UseNativeQuery=1 is set, the logged queries look like what you would normally expect - this:

select count(*) from my_db.some_table where year=2016 and month=10 and day=2 

But, if UseNativeQuery=0 is used, then the same query will end up looking like this:

SELECT COUNT(1) AS `C_455850525f31`  FROM `my_db`.`some_table`  `C_6a7570315f73746174735f63726f5f6361706163697479` WHERE
((`C_6a7570315f73746174735f63726f5f6361706163697479`.`day` = 2) AND (`C_6a7570315f73746174735f63726f5f6361706163697479`.`year` = 2016)
AND (`C_6a7570315f73746174735f63726f5f6361706163697479`.`month` = 10)) 

If you had to debug queries in logs, which would you prefer? Right! ALWAYS set UseNativeQuery=1. Plus, per the docs, this also improves performance by reducing the overhead of unnecessary query transformation.

4) Test your Connection via Command Line

When you installed the unixODBC packages, it came with "isql", an interactive SQL CLI. Run the following command to see if you can connect to Hive via ODBC via Knox using the DSN you just created above. You should get some valid results from your query:

[mpetronic@mpws Downloads]$ echo "show databases" | isql -v -d, Knox <your-Linux-username> <your-Linux-password>
+---------------------------------------+
| Connected!                            |
|                                       |
| sql-statement                         |
| help [tablename]                      |
| quit                                  |
|                                       |
+---------------------------------------+
SQL> show databases
default
test_db
another_db

5) Test your Connection via PyODBC

The following is a simple Python app that uses PyODBC to access the same database using the same DSN. Copy the below sample code to your workstation, read the comments in the script and do what they say to do to be able to run the program. Then run it to verify you can successfully connect and execute a query via Knox. It will print the query results to stdout.

#!/usr/bin/env python
import pyodbc
import argparse
'''
Demonstrates the bare bones code needed to access Hive via Beeline via Knox.
For this sample to work, you must provide either the full ODBC configuration
via a connection string or reference an odbc.ini file that already has a DSN
defined for the connection to Knox you are trying to make.

To go the odbc.ini route, do the following:

1. Create .odbc.ini in your $HOME directory. Note this is a hidden file and that
   is what the ODBC driver expects by default

2. Populate $HOME/.odbc.ini with the following values. These are the minimum values
   needed for the file to work with Knox. I am not going to explain all the values
   in detail. If you want to know what each one means, look them up in this doc:
   https://hortonworks.com/wp-content/uploads/2016/03/Hortonworks-Hive-ODBC-Driver-User-Guide.pdf.
   Also, change the host, trustStorePassword, and ClientCert values as needed for your setup.

[ODBC Data Sources]
# This key 'Knox' can be anything but it MUST MATCH the key name used in the
# section below. The right side is just descriptive and can be anything
Knox=DNS for access to Hive via Knox
[Knox]
# Optional description of this DSN
Description=Hortonworks Knox QA DSN
# Where the actually ODBC driver library is loaded from. This is the default
# location where Hortonworks installs the driver
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
# The hostname of the sever running the Knox gateway
Host=api01.qa
# The port used to connect to the Knox gateway
port=8443
# The same Http path value used in a typical beeline connect string that
# connects to Knox. For Quasar, it is always "quasar/jupstats/hive"
HttpPath=quasar/jupstats/hive
# The default database you want to be one once beeline connects to Hive
schema=jup1_stats
# Must be zero as we want to connect directly to Hiveserver2 inside the cluster.
ServiceDiscoveryMode=0
# Must be 2 because we are using hiveserver2 which is what beeline uses as well
HiveServerType=2
# This indicates we need to provide a username and passwork in the connect string
AuthMech=3
# This indicates the transport mode for Hive is HTTP, which is required for Knox
ThriftTransport=2
# Indicates the connection to Knox uses SSL
SSL=1
# But NOT two-way SSL meaning no client side authentication - only server-side
TwoWaySSL=0
# These two are the same values used in the beeline connect string. This is the
# password you used when you added the Knox certificate to your own trust store
trustStorePassword=<your-trust-store-password>
ClientCert=/home/mpetronic/keystore/keystore.jks
# Must be 1 to support the self-signed certificate mode we use for Knox
AllowSelfSignedServerCert=1

3. Install dependencies for this script to run. First, install the unixODBC-devel
   package (via yum, if you can or download the RPM and install that way). This
   is needed for header files that are used build pyodbc. Then install PyODBC using
   "pip install pyodbc==3.0.10". Note that the pyodbc installation needs to build
   some source code so you will also need the gcc tools chain installed.

4. Run this script and provide the require command line options to test your connection
To instead, NOT require the $HOME/.odbc.ini file, you can build a DSN string directly
in code. Below is an example of how you can build a connection string to directly pass
into the pyodbc.connect() method. Using this approach, you DO NOT need to have a .odbc.ini
file present. This gives you full programmatic control of the process if you wish to
use it this way. This simply builds a string like "A=1;B=2;C=3" containing all the
exact same parameters that could be defined in the .odbc.ini file.
'''
CONNECTION_STRING_EXAMPLE = ';'.join('''
Description=Hortonworks Knox DSN
Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so
Host=api01
port=8443
HttpPath=quasar/jupstats/hive
schema=default
ServiceDiscoveryMode=0
HiveServerType=2
AuthMech=3
ThriftTransport=2
SSL=1
TwoWaySSL=0
trustStorePassword=<your-trust-store-password>
ClientCert=/home/mpetronic/keystore/keystore.jks
AllowSelfSignedServerCert=1
uid=<your-linux-username>
pwd=<your-linux-password>
'''.splitlines())
#===============================================================================
def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('--user', '-u', type=str, required=True,
        help='User name used to connect to Hive with beeline via Knox.')
    parser.add_argument('--password', '-p', type=str, required=True,
        help='Password used to connect to Hive with beeline via Knox.')
    parser.add_argument('--dsn', '-d', type=str, required=True,
        help='The DSN name from $HOME/.odbc.ini to use for this connection.')
    return parser.parse_args()
#===============================================================================
def execute_query(sql, dsn, user, password):
    try:
        connect_string = 'DSN=%(dsn)s;uid=%(user)s;pwd=%(password)s' % locals()
        conn = pyodbc.connect(connect_string, autocommit=True)
        cursor = conn.cursor()
        cursor.execute(sql)
        for row in cursor:
            print row
    except Exception as e:
        print e
    finally:
        cursor.close()
        conn.close()
###########################################################
# MAIN
###########################################################
if __name__ == '__main__':
    args = parse_args()
    execute_query('show databases', args.dsn, args.user, args.password)
15,256 Views
Comments
avatar
New Contributor

Hi,

i was trying to make connection string from hive database via.I created connections string like connect_string = 'Driver=/usr/lib/hive/lib/native/Linux-amd64-64/libhortonworkshiveodbc64.so;Host=xxx;port=8443;HiveServerType=2;AuthMech=3;uid=xxx;pwd=xxx;SSL=1;AllowSelfSignedServerCert=1;CAIssuedCertNamesMismatch=1;TrustedCerts=/usr/lib/hive/lib/native/Linux-amd64-64/cacerts.pem;HttpPath=gateway/default/hive;ThriftTransport=2;ServiceDiscoveryMode=0'

But when i running python code which is mention above i am getting following error:

('IM002', '[IM002] [unixODBC][Driver Manager]Data source name not found, and no default driver specified (0) (SQLDriverConnect)') Traceback (most recent call last):

I am successfully execute code by creating dsn and specify that code but i want to use connection string because i don;t have permission to create ~/.odbc.ini file on machine.

please provide your input on that if i missing something.

Mandar Vaidya

avatar
Expert Contributor

You should be able to export an environment variable and place the odbc.ini file anywhere you want - where you do have write access. Can you try this and then run your test? I never tried making it work completely with just a connect string.

export ODBCINI=/path/to/your/odbc.ini
avatar
New Contributor

Hi, I have specified the absolute path for the keystore at clientcert. While connecting, it prompts [unixODBC][Cloudera][DriverSupport] (1100) SSL certificate verification failed because the certificate is missing or incorrect. 

Is there anyway to test if the keystore JKS file is valid. Please let me know if you encounter the same or is there any solution to fix this. Appreciate your response. 

-Subhani

avatar
New Contributor

https://community.cloudera.com/t5/Support-Questions/Hortonworks-Hive-ODBC-SSL-certificate-verificati...

This helped me! I had to add the entire host certificate chain to /opt/cloudera/hiveodbc/lib/64/cacert.pem file