question Re: Migrating from one hive table to another hive table Using Spark,withe differend colum name and database with same cluster in Archives of Support Questions (Read Only)

Migrating from one hive table to another hive table Using Spark,withe differend colum name and database with same cluster

hadoopsmi — Wed, 02 Nov 2016 20:29:46 GMT

Hive Table:

Orginal table

Database Name : Student

Tabe name : Student_detail

id	name	dept
1	siva	cse

Need Output :

Database Name : CSE

Tabe name : New_tudent_detail

s_id	s_name	s_dept
1	siva	cse

i want Migrate Student_detail hive table into New_tudent_detail without data lose using spark

Different colum name

Different database

Different table

Re: Migrating from one hive table to another hive table Using Spark,withe differend colum name and database with same cluster

mlamairesse — Tue, 08 Nov 2016 03:41:37 GMT

Hi @Sivasaravanakumar K

Here's one way of going about this.

Note the example below is based on the sample data available on the hortonworks sandbox. Just change the database, table and column name to suit you needs

0. Get database and table info

//show databases in Hive
sqlContext.sql("show databases").show

//show table in a database
sqlContext.sql("show tables in default").show

//read the table headers
sqlContext.sql("select * from default.sample_07").printSchema

result

--------+
|  result|
+--------+
| default|
|foodmart|
|  xademo|
+--------+

+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|sample_07|      false|
|sample_08|      false|
+---------+-----------+

root
 |-- code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- total_emp: integer (nullable = true)
 |-- salary: integer (nullable = true)

1. Read table data into a DataFrame :

// read data from Hive
val df = sqlContext.sql("select * from default.sample_07")
//Show Table Schema 
df.printSchema

result

root
 |-- code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- total_emp: integer (nullable = true)
 |-- salary: integer (nullable = true)

2. Change column names

Change a single column name with the withColumnRenamed function

val df_renamed = df.withColumnRenamed("salary", "money") 
df_renamed.printSchema

Or all at once using a list of header

val newNames = Seq("code_1", "description_1", "total_emp_1", "money_1") 
val df_renamed = df.toDF(newNames: _*) 
df_renamed.printSchema

Note you can combine reading toghether so as not to create 2 sets of data in memory

val newNames = Seq("code_1", "description_1", "total_emp_1", "money_1") 
val df = sqlContext.sql("select * from default.sample_07").toDF(newNames: _*)

Or all at once using SQL alias (** preferred)

val df = sqlContext.sql("select code as code_1, description as description_1, total_emp as total_emp_1, salary as money from default.sample_07") 

df.printSchema

result (using SQL alias)

df: org.apache.spark.sql.DataFrame = [code_1: string, description_1: string, total_emp_1: int, money: int]
root
 |-- code_1: string (nullable = true)
 |-- description_1: string (nullable = true)
 |-- total_emp_1: integer (nullable = true)
 |-- money: integer (nullable = true)

3. Save back to hive

//write to Hive (in ORC format) 
df.write.format("orc").saveAsTable("default.sample_07_new_schema") 

//read back and check new_schema
sqlContext.sql("select * from default.sample_07_new_schema").printSchema

result

root
 |-- code_1: string (nullable = true)
 |-- description_1: string (nullable = true)
 |-- total_emp_1: integer (nullable = true)
 |-- money: integer (nullable = true)

Re: Migrating from one hive table to another hive table Using Spark,withe differend colum name and database with same cluster

hadoopsmi — Tue, 08 Nov 2016 15:29:50 GMT

Hi @Matthieu Lamairesse

Error :

scala> df.write.format("orc").saveAsTable("default.sample_07_new_schema") <console>:33: error: value write is not a member of org.apache.spark.sql.DataFrame df.write.format("orc").saveAsTable("default.sample_07_new_schema")

Re: Migrating from one hive table to another hive table Using Spark,withe differend colum name and database with same cluster

mlamairesse — Tue, 08 Nov 2016 19:37:23 GMT

Hi @Sivasaravanakumar K

I've simplified my answer a bit. What version of spark are you using ? This was tested on Spark 1.6.2 on a HDP 2.5 sandbox

Note : When using spark-shell did you import :

import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._

Re: Migrating from one hive table to another hive table Using Spark,withe differend colum name and database with same cluster

hadoopsmi — Wed, 09 Nov 2016 14:11:41 GMT

i already import

import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql._

still i have the same issue i am using HDP 2.3

Re: Migrating from one hive table to another hive table Using Spark,withe differend colum name and database with same cluster

mlamairesse — Thu, 10 Nov 2016 01:23:08 GMT

um which version of Spark ?

1.3.1 => HDP 2.3.0

1.4.1 => HDP 2.3.2

1.5.2 => HDP 2.3.4

I have a feeling it's spark 1.3, they made some major improvement in spark <=> Hive integration starting with spark 1.4.1.

Re: Migrating from one hive table to another hive table Using Spark,withe differend colum name and database with same cluster

mlamairesse — Thu, 10 Nov 2016 02:19:06 GMT

Hi @Sivasaravanakumar K

The write function was implemented in 1.4.1...

Try simply :

df.saveAsTable("default.sample_07_new_schema")

It will be saved as Parquet (default format for Spark)