Reply
Highlighted
Explorer
Posts: 29
Registered: ‎03-28-2017

How does spark work with a JDBC connection?

I am new to Spark and I am trying to work on a spark-jdbc program to find count of number of rows in a database. I have come up with this code:

 

object PartitionRetrieval {
    var conf  = new SparkConf().setAppName("Spark-JDBC")
    val log   = LogManager.getLogger("Spark-JDBC Program")
    Logger.getLogger("org").setLevel(Level.ERROR)
    val conFile       = "/home/hmusr/ReconTest/inputdir/testconnection.properties"
    val properties    = new Properties()
    properties.load(new FileInputStream(conFile))
    val connectionUrl = properties.getProperty("gpDevUrl")
    val devUserName   = properties.getProperty("devUserName")
    val devPassword   = properties.getProperty("devPassword")
    val driverClass   = properties.getProperty("gpDriverClass")
    val tableName     = "source.bank_accounts"
    try {
    Class.forName(driverClass).newInstance()
    } catch {
    case cnf: ClassNotFoundException =>
        log.error("Driver class: " + driverClass + " not found")
        System.exit(1)
    case e: Exception =>
        log.error("Exception: " + e.printStackTrace())
        System.exit(1)
    }
    def main(args: Array[String]): Unit = {
        val spark   = SparkSession.builder().config(conf).master("yarn").enableHiveSupport().getOrCreate()
        val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
                                                        .option("dbtable",tableName)
                                                        .option("user",devUserName)
                                                        .option("password",devPassword).load()
        val rc = gpTable.filter(gpTable("source_system_name")==="ORACLE").count()
        println("gpTable Count: " + rc)
    }
}

 

So far, this code is working. But I have some conceptual doubts about this.

  1. In Java, we create a connection class and use that connection to query multiple tables and close it once our requirement is met. But it appears to work in a different way. If I have to query 10 tables in a database, should I use this line 10 times with different tables names in it: In Java, we create a connection class and use that connection to query multiple tables and close it once our requirement is met. But it appears to work in a different way. If I have to query 10 tables in a database, should I use this line 10 times with different tables names in it:
    val gpTable = spark.read.format("jdbc").option("url", connectionUrl)
                                                    .option("dbtable",tableName)
                                                    .option("user",devUserName)
                                                    .option("password",devPassword).load()
  2. The current table used here has total rows of 2000. I can use the filter/select/aggregate functions accordingly. But in our production there are tables with millions of rows and if I put one of the huge table in the above statement, even though our requirement has filtering it later, wouldn't is create a huge dataframe first ?

 

Could anyone care to give me some insight regarding the doubts I mentioned above ?

Announcements