I have installed Cloudera 5.16.2 on Ubuntu 16.04.
Most of the data I need to work on are in MySQL or Postgresql databases and I want to use Apache Spark to work on this data directly. I prefer working with PySpark and have installed and configured Anaconda through Cloudera Manager.
How would I connect to MySQL / Postgres using Apache Spark on a Jupyter Notebook? Could you give me a step by step guide to achieve this?
Thank you
Created 11-10-2019 06:56 PM
@Tipo33
You might want to refer to the following Articles / Community discussions.
https://community.cloudera.com/t5/Community-Articles/Connecting-to-an-external-RDBMS-MySQL-using-sca...
https://community.cloudera.com/t5/Support-Questions/Spark-Streaming-save-output-to-mysql-DB/td-p/256...
Created 11-10-2019 07:28 PM
Hi @jsensharma. Thanks for the reply. I realised that I had not worded my query properly and am actually looking to be able to connect to MySQL / Postgres through Apache spark using a Jupyter Workbook. I am more comfortable working with Python and am not at all familiar with Scala /Java (can't stand them either). Apologies in advance for the vague request earlier
Created 11-11-2019 12:35 PM
I think the below link should help you achieve that, usually to be able to connect to a remote database the DNS should be able to resolve the FQDN/IP in the example in this link it could be like this?
https://pjryan126.github.io/jupyter-db-connections/
import urllib params = 'DRIVER={ODBC Driver 13 for SQL Server};' \ 'SERVER=192.168.1.200;' \ or 'SERVER=zephyr.tesco.uk;' 'PORT=1433;' \ 'DATABASE=nb-database;' \ 'UID=nb-user;' \ 'PWD=nb-password;' params = urllib.parse.quote_plus(params)
Here is a snippet for connecting to Mysql
import MySQLdb
Next, enter your database credentials.
#Enter the values for you database connection
dsn_database = "Enter Database name" # e.g. "MySQLdbtest"
dsn_hostname = "<Enter Hostname" # e.g.: "mydbinstance.xyz.us-east-1.rds.amazonaws.com"
dsn_port = 3306 # e.g. 3306
dsn_uid = "<Enter UserID>" # e.g. "user1"
dsn_pwd = "<Enter Password" # e.g. "Password123"
Now to establish the database connection.
conn = MySQLdb.connect(host=dsn_hostname, port=dsn_port, user=dsn_uid, passwd=dsn_pwd, db=dsn_database)
HTH
Created 11-13-2019 12:54 AM
Thanks @Shelton. Just to check: When Python directly connects with the MySQL database, wouldn't that infer that only Python is working on the tables, and not Spark? This is more so since an ODBC is used and not a JDBC.
Is there a way for Spark to directly access the MySQL database similar to what @jsensharma pointed out above?