03-05-2018 01:36 PM
I tried to create a function which would get the data from relational database and insert them into Hive table. Since I use Spark 1.6, I need to register a temporary table, because writing dataframe directly as Hive table is not compatible with Hive:
spark_conf = SparkConf() sc = SparkContext(conf=spark_conf) sqlContext = HiveContext(sc) query = "(select * from employees where emp_no < 10008) as emp_alias" df = sqlContext.read.format("jdbc").option("url", url) \ .option("dbtable", query) \ .option("user", user) \ .option("password", pswd).load() df.registerTempTable('tempEmp') sqlContext.sql('insert into table employment_db.hive_employees select * from tempEmp')
The employees table in RDB contains few thousand records. After running my program I can see that two parquet files are created:
So when I try to select from the Hive table after the job is completed, there are missing records.
I have multiple ideas, which could cause the problem: