Support Questions
Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

How do I access MySQL/ Postgres (external) databases and tables directly using Apache Spark?


I have installed Cloudera 5.16.2 on Ubuntu 16.04. 


Most of the data I need to work on are in MySQL or Postgresql databases and I want to use Apache Spark to work on this data directly. I prefer working with PySpark and have installed and configured Anaconda through Cloudera Manager. 


How would I connect to MySQL / Postgres using Apache Spark on a Jupyter Notebook? Could you give me a step by step guide to achieve this?


Thank you 


Super Mentor


Hi @jsensharma. Thanks for the reply. I realised that I had not worded my query properly and am actually looking to be able to connect to MySQL / Postgres through Apache spark using a Jupyter Workbook. I am more comfortable working with Python and am not at all familiar with Scala /Java (can't stand them either). Apologies in advance for the vague request earlier




I think the below link should help you achieve that, usually to be able to connect to a remote database the DNS should be able to resolve the  FQDN/IP  in the example in this link it could be like this?


import urllib

params = 'DRIVER={ODBC Driver 13 for SQL Server};' \
         'SERVER=;' \ or ';'
         'PORT=1433;' \
         'DATABASE=nb-database;' \
         'UID=nb-user;' \
params = urllib.parse.quote_plus(params)

Here is a snippet for connecting to Mysql

import MySQLdb

Next, enter your database credentials.

#Enter the values for you database connection
dsn_database = "Enter Database name"  # e.g. "MySQLdbtest"
dsn_hostname = "<Enter Hostname" # e.g.: ""
dsn_port = 3306               # e.g. 3306
dsn_uid = "<Enter UserID>"       # e.g. "user1"
dsn_pwd = "<Enter Password"     # e.g. "Password123"

Now to establish the database connection.

conn = MySQLdb.connect(host=dsn_hostname, port=dsn_port, user=dsn_uid, passwd=dsn_pwd, db=dsn_database)














Thanks @Shelton. Just to check: When Python directly connects with the MySQL database, wouldn't that infer that only Python is working on the tables, and not Spark? This is more so since an ODBC is used and not a JDBC.


Is there a way for Spark to directly access the MySQL database similar to what @jsensharma pointed out above?