- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Migrating from one hive table to another hive table Using Spark,withe differend colum name and database with same cluster
- Labels:
-
Apache Hive
-
Apache Spark
Created ‎11-02-2016 01:29 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hive Table:
Orginal table
Database Name : Student
Tabe name : Student_detail
id | name | dept |
1 | siva | cse |
Need Output :
Database Name : CSE
Tabe name : New_tudent_detail
s_id | s_name | s_dept |
1 | siva | cse |
i want Migrate Student_detail hive table into New_tudent_detail without data lose using spark
Different colum name
Different database
Different table
Created ‎11-07-2016 07:41 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's one way of going about this.
Note the example below is based on the sample data available on the hortonworks sandbox. Just change the database, table and column name to suit you needs
0. Get database and table info
//show databases in Hive sqlContext.sql("show databases").show //show table in a database sqlContext.sql("show tables in default").show //read the table headers sqlContext.sql("select * from default.sample_07").printSchema
result
--------+ | result| +--------+ | default| |foodmart| | xademo| +--------+ +---------+-----------+ |tableName|isTemporary| +---------+-----------+ |sample_07| false| |sample_08| false| +---------+-----------+ root |-- code: string (nullable = true) |-- description: string (nullable = true) |-- total_emp: integer (nullable = true) |-- salary: integer (nullable = true)
1. Read table data into a DataFrame :
// read data from Hive val df = sqlContext.sql("select * from default.sample_07") //Show Table Schema df.printSchema
result
root |-- code: string (nullable = true) |-- description: string (nullable = true) |-- total_emp: integer (nullable = true) |-- salary: integer (nullable = true)
2. Change column names
Change a single column name with the withColumnRenamed function
val df_renamed = df.withColumnRenamed("salary", "money") df_renamed.printSchema
Or all at once using a list of header
val newNames = Seq("code_1", "description_1", "total_emp_1", "money_1") val df_renamed = df.toDF(newNames: _*) df_renamed.printSchema
Note you can combine reading toghether so as not to create 2 sets of data in memory
val newNames = Seq("code_1", "description_1", "total_emp_1", "money_1") val df = sqlContext.sql("select * from default.sample_07").toDF(newNames: _*)
Or all at once using SQL alias (** preferred)
val df = sqlContext.sql("select code as code_1, description as description_1, total_emp as total_emp_1, salary as money from default.sample_07") df.printSchema
result (using SQL alias)
df: org.apache.spark.sql.DataFrame = [code_1: string, description_1: string, total_emp_1: int, money: int] root |-- code_1: string (nullable = true) |-- description_1: string (nullable = true) |-- total_emp_1: integer (nullable = true) |-- money: integer (nullable = true)
3. Save back to hive
//write to Hive (in ORC format) df.write.format("orc").saveAsTable("default.sample_07_new_schema") //read back and check new_schema sqlContext.sql("select * from default.sample_07_new_schema").printSchema
result
root |-- code_1: string (nullable = true) |-- description_1: string (nullable = true) |-- total_emp_1: integer (nullable = true) |-- money: integer (nullable = true)
Created ‎11-07-2016 07:41 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's one way of going about this.
Note the example below is based on the sample data available on the hortonworks sandbox. Just change the database, table and column name to suit you needs
0. Get database and table info
//show databases in Hive sqlContext.sql("show databases").show //show table in a database sqlContext.sql("show tables in default").show //read the table headers sqlContext.sql("select * from default.sample_07").printSchema
result
--------+ | result| +--------+ | default| |foodmart| | xademo| +--------+ +---------+-----------+ |tableName|isTemporary| +---------+-----------+ |sample_07| false| |sample_08| false| +---------+-----------+ root |-- code: string (nullable = true) |-- description: string (nullable = true) |-- total_emp: integer (nullable = true) |-- salary: integer (nullable = true)
1. Read table data into a DataFrame :
// read data from Hive val df = sqlContext.sql("select * from default.sample_07") //Show Table Schema df.printSchema
result
root |-- code: string (nullable = true) |-- description: string (nullable = true) |-- total_emp: integer (nullable = true) |-- salary: integer (nullable = true)
2. Change column names
Change a single column name with the withColumnRenamed function
val df_renamed = df.withColumnRenamed("salary", "money") df_renamed.printSchema
Or all at once using a list of header
val newNames = Seq("code_1", "description_1", "total_emp_1", "money_1") val df_renamed = df.toDF(newNames: _*) df_renamed.printSchema
Note you can combine reading toghether so as not to create 2 sets of data in memory
val newNames = Seq("code_1", "description_1", "total_emp_1", "money_1") val df = sqlContext.sql("select * from default.sample_07").toDF(newNames: _*)
Or all at once using SQL alias (** preferred)
val df = sqlContext.sql("select code as code_1, description as description_1, total_emp as total_emp_1, salary as money from default.sample_07") df.printSchema
result (using SQL alias)
df: org.apache.spark.sql.DataFrame = [code_1: string, description_1: string, total_emp_1: int, money: int] root |-- code_1: string (nullable = true) |-- description_1: string (nullable = true) |-- total_emp_1: integer (nullable = true) |-- money: integer (nullable = true)
3. Save back to hive
//write to Hive (in ORC format) df.write.format("orc").saveAsTable("default.sample_07_new_schema") //read back and check new_schema sqlContext.sql("select * from default.sample_07_new_schema").printSchema
result
root |-- code_1: string (nullable = true) |-- description_1: string (nullable = true) |-- total_emp_1: integer (nullable = true) |-- money: integer (nullable = true)
Created ‎11-08-2016 07:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Matthieu Lamairesse
Error :
scala> df.write.format("orc").saveAsTable("default.sample_07_new_schema") <console>:33: error: value write is not a member of org.apache.spark.sql.DataFrame df.write.format("orc").saveAsTable("default.sample_07_new_schema")
^
Created ‎11-08-2016 11:37 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've simplified my answer a bit. What version of spark are you using ? This was tested on Spark 1.6.2 on a HDP 2.5 sandbox
Note : When using spark-shell did you import :
import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._
Created ‎11-09-2016 06:11 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
i already import
import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._
still i have the same issue i am using HDP 2.3
Created ‎11-09-2016 05:23 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
um which version of Spark ?
1.3.1 => HDP 2.3.0
1.4.1 => HDP 2.3.2
1.5.2 => HDP 2.3.4
I have a feeling it's spark 1.3, they made some major improvement in spark <=> Hive integration starting with spark 1.4.1.
Created ‎11-09-2016 06:19 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The write function was implemented in 1.4.1...
Try simply :
df.saveAsTable("default.sample_07_new_schema")
It will be saved as Parquet (default format for Spark)
