Community Articles

gbraccialli3 · ‎11-24-2015

Spark SQL comes with a nice feature called: "JDBC to other Databases", but, it practice, it's JDBC federation feature.

"It can be used to create data frames from jdbc databases using scala/python, but it also works directly with Spark SQL Thrift server and allow us to query external JDBC tables seamless like other hive/spark tables."

Example below using sandbox 2.3.2 and spark 1.5.1 TP (https://hortonworks.com/hadoop-tutorial/apache-spark-1-5-1-technical-preview-with-hdp-2-3/).

This feature works with spark-submit, spark-shell, zeppelin, spark-sql client and spark sql thift server. In this post two examples: #1 using spark-sql thrift server, #2 using spark-shell.

Example #1 using Spark SQL Thrift Server

1- Run Spark SQL Thrift Server with mysql jdbc driver:

sudo -u spark /usr/hdp/2.3.2.1-12/spark/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10010 --jars "/usr/share/java/mysql-connector-java.jar"

2- Open beeline and connect to Spark SQL Thrift Server:

beeline -u "jdbc:hive2://localhost:10010/default" -n admin

3- Create a jdbc federated table pointing to existing mysql database, using beeline:

CREATE TABLE mysql_federated_sample
USING org.apache.spark.sql.jdbc
OPTIONS (
  driver "com.mysql.jdbc.Driver",
  url "jdbc:mysql://localhost/hive?user=hive&password=hive",
  dbtable "TBLS"
);
describe mysql_federated_sample;
select * from mysql_federated_sample;
select count(1) from mysql_federated_sample;

Example #2 using Spark shell, scala code and data frames:

1- Open spark-shell with mysql jdbc driver

spark-shell --jars "/usr/share/java/mysql-connector-java.jar"

2- Create a data frame pointing to mysql table

val jdbcDF = sqlContext.read.format("jdbc").options( 
  Map(
  "driver" -> "com.mysql.jdbc.Driver",
  "url" -> "jdbc:mysql://localhost/hive?user=hive&password=hive",
  "dbtable" -> "TBLS"
  )
).load()
  
jdbcDF.show

See other spark jdbc examples / troubleshooting here:

https://community.hortonworks.com/questions/1942/spark-to-phoenix.html

Report Inappropriate Content · ‎08-03-2016

@Guilherme Braccialli but Spark thrift server listens through port 10015, and I think if you connect through 10000, then it’s HiveServer2 thrift server. Maybe I’m missing something, it may be helpful if we can discuss this

Report Inappropriate Content · ‎08-03-2016

but Spark thrift server listens through port 10015, and I think if you connect through 10000, then it’s HiveServer2 thrift server. Maybe I’m missing something, it may be helpful if we can discuss this

gbraccialli3 · ‎08-23-2016

@kishore sanchina I'm this example I started Spark thrift server on port 10010 and connected using beeline to same port. You can use default port 10015 instead.

Cloudera Community

Community Articles

SparkSQL jdbc Federation

Apache Spark

Re: SparkSQL jdbc Federation

Re: SparkSQL jdbc Federation

Re: SparkSQL jdbc Federation

Federated SparkSQL using SAP HANA and HDP Hive usi...

Spark RDDs vs DataFrames vs SparkSQL

How to change column Type in SparkSQL?

How to Connect to Hiveserver2 Using Cloudera JDBC ...

Using Hive UDF/UDAF/UDTF with SparkSQL

KnoxSSO Integration with Identity Provider (Ping F...

Support for Hive DatabaseType in JDBC Storage Hand...

HiveServer2 JDBC Connection URL Examples

Hive partitions on different Namespaces in a Feder...

Querying Data via SparkSQL with ODBC Tools