Cloudera Machine Learning (and Cloudera Data Science Workbench) is built on a very robust and flexible framework to ease integration with third parties. In this article, I decided to explore the integration of Teradata with CML via ODBC.
A few notes before I dive in:
To get a fully rounded tutorial, I will go through these 3 steps:
If you want to do development testing of Teradata, you can use Teradata Vantage Developer on the AWS Market Place. I am not going to dive too much into this as this configuration may change as Teradata evolves.
The Teradata Vantage Developer is an AWS CloudFormation template, that requires the following assets to be pre-created:
$ ssh ec2-user@[your_public_hostname]
ec2-user@SMP001-01:~> bteq
Teradata BTEQ 16.20.00.01 for LINUX. PID: 17205
Copyright 1984-2017, Teradata Corporation. ALL RIGHTS RESERVED.
Enter your logon or BTEQ command:
.LOGON
UserId: dbc
Password: [password_set_in_cf_template]
CREATE user test AS password=test perm = 200000000, SPOOL = 100000000;
CREATE TABLE "test".TEST_TABLE (
COLUMN1 VARCHAR(100),
COLUMN2 VARCHAR(100)
) ;
#Dockerfile
FROM docker.repository.cloudera.com/cdsw/engine:11-cml-2020.04-1
WORKDIR /tmp
#The RUN commands that install an editor
#For example: RUN apt-get install myeditor
RUN apt-get autoremove unixodbc -y
RUN apt-get update -y
RUN apt-get install lib32stdc++6 -y
RUN apt-get install wget -y
RUN wget [LOCATION_OF_DRIVERS]
RUN tar -xzvf tdodbc1620__ubuntu_indep.16.20.00.87-1.tar.gz
RUN dpkg -i tdodbc1620/tdodbc1620-16.20.00.87-1.noarch.deb
RUN apt-get install -y python-pyodbc
pip3 install pyodbc
vi /home/cdsw/.odbc.ini
[ODBC Data Sources]
Teradata ODBC DSN = Teradata ODBC Driver 16.20
TEST = My DSN Description
[TEST]
Driver = /opt/teradata/client/16.20/odbc_64/lib/tdataodbc_sb64.so
Description = Teradata database
DBCName = [your_teradata_hostname]
UID = test
PWD = test
import pyodbc
import pandas as pd
conn = pyodbc.connect('DSN=TEST')
# Define Cursor
cus=conn.cursor()
query = "select * from test.TEST_TABLE;"
# Execute SQL statement to get current datetime and store result in cursor
cus.execute(query)
# Display the content of cursor
row = cus.fetchone()
print(row)
# Use Pandas to execute and retrieve results
df = pd.read_sql(query, conn)
print(df)
The output in the workbench should look something like the following: