Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How to execute a Spark program that loads a Hive table?

Contributor

I am new to Spark. I learnt how to load a hive program from Spark-shell. I tried to do the same from eclipse and here is the program I have written.

 

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SaveMode

object SuperSpark {
  case class partclass(id:Int, name:String, salary:Int, dept:String, location:String)
  def main(argds: Array[String]) {
    val warehouseLocation = "file:${system:user.dir}/spark-warehouse"
    val sparkSession = SparkSession.builder.master("local[2]").appName("Saving data into HiveTable using Spark")
                        .enableHiveSupport()
                        .config("hive.exec.dynamic.partition", "true")
                        .config("hive.exec.dynamic.partition.mode", "nonstrict")
                        .config("hive.metastore.warehouse.dir", "/user/hive/warehouse")
                         .config("spark.sql.warehouse.dir", warehouseLocation)
                        .getOrCreate()
    import sparkSession.implicits._

    val partfile = sparkSession.read.textFile("partfile")
    val partdata = partfile.map(p => p.split(","))
    val partRDD  = partdata.map(line => partclass(line(0).toInt, line(1), line(2).toInt, line(3), line(4)))
    val partDF   = partRDD.toDF()
    partDF.write.mode(SaveMode.Append).insertInto("parttab")
  }
}

What I don't understand now is how to execute this program ? I'm stuck at these points.

  1. How to add connection and db details of Hive tables in the program. could anyone tell how to add those details programmatically ?
  2. Should I use the 'spark-submit' option or just do 'run as scala application' from eclipse to run the above program ?
4 REPLIES 4

Champion

Please refer this link .

this should give you a kick start .

if you need any more details let me know

 

https://github.com/mahmoudparsian/data-algorithms-book/blob/master/misc/how-to-submit-spark-job-to-y...

Champion
  1. How to add connection and db details of Hive tables in the program. could anyone tell how to add those details programmatically ?

The code you posted is already inserting the data in the file into the table parttab.  To change the db you could use 'partDF.sql("use newdb")'

 

  1. Should I use the 'spark-submit' option or just do 'run as scala application' from eclipse to run the above program ?

Yes you should configure a run time for Spark and run it in ecplise.  After that runs without errors then build it and upload it to the cluster, and use spark-submit to run it on the cluster.

Contributor

@mbigelow

Regarding the connection details,

  1.  I meant, where do we give the 'localhost' or 'ip address' and port number of the database.
  2. I know that the insert statement would insert the data into hive table but where do we give those details from point 1.
    Could you tell where can I specify those details ?

Champion
The line:

enableHiveSupport

I believe this is the 2.0 version of the hive context. This will read the hive configs, /etc/hive/conf, and get the hive details. You can change them using the config method of the session, or set them if you don't have the config files. You can also specify a different location for the config files.

There should also be a getconf method but I am not positive. You can verify the hive configs.
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.