Created 11-02-2016 01:29 PM
Hive Table:
Orginal table
Database Name : Student
Tabe name : Student_detail
| id | name | dept |
| 1 | siva | cse |
Need Output :
Database Name : CSE
Tabe name : New_tudent_detail
| s_id | s_name | s_dept |
| 1 | siva | cse |
i want Migrate Student_detail hive table into New_tudent_detail without data lose using spark
Different colum name
Different database
Different table
Created 11-07-2016 07:41 PM
Here's one way of going about this.
Note the example below is based on the sample data available on the hortonworks sandbox. Just change the database, table and column name to suit you needs
0. Get database and table info
//show databases in Hive
sqlContext.sql("show databases").show
//show table in a database
sqlContext.sql("show tables in default").show
//read the table headers
sqlContext.sql("select * from default.sample_07").printSchema
result
--------+ | result| +--------+ | default| |foodmart| | xademo| +--------+ +---------+-----------+ |tableName|isTemporary| +---------+-----------+ |sample_07| false| |sample_08| false| +---------+-----------+ root |-- code: string (nullable = true) |-- description: string (nullable = true) |-- total_emp: integer (nullable = true) |-- salary: integer (nullable = true)
1. Read table data into a DataFrame :
// read data from Hive
val df = sqlContext.sql("select * from default.sample_07")
//Show Table Schema
df.printSchemaresult
root |-- code: string (nullable = true) |-- description: string (nullable = true) |-- total_emp: integer (nullable = true) |-- salary: integer (nullable = true)
2. Change column names
Change a single column name with the withColumnRenamed function
val df_renamed = df.withColumnRenamed("salary", "money")
df_renamed.printSchema Or all at once using a list of header
val newNames = Seq("code_1", "description_1", "total_emp_1", "money_1")
val df_renamed = df.toDF(newNames: _*)
df_renamed.printSchema Note you can combine reading toghether so as not to create 2 sets of data in memory
val newNames = Seq("code_1", "description_1", "total_emp_1", "money_1")
val df = sqlContext.sql("select * from default.sample_07").toDF(newNames: _*)Or all at once using SQL alias (** preferred)
val df = sqlContext.sql("select code as code_1, description as description_1, total_emp as total_emp_1, salary as money from default.sample_07")
df.printSchemaresult (using SQL alias)
df: org.apache.spark.sql.DataFrame = [code_1: string, description_1: string, total_emp_1: int, money: int] root |-- code_1: string (nullable = true) |-- description_1: string (nullable = true) |-- total_emp_1: integer (nullable = true) |-- money: integer (nullable = true)
3. Save back to hive
//write to Hive (in ORC format)
df.write.format("orc").saveAsTable("default.sample_07_new_schema")
//read back and check new_schema
sqlContext.sql("select * from default.sample_07_new_schema").printSchema
result
root |-- code_1: string (nullable = true) |-- description_1: string (nullable = true) |-- total_emp_1: integer (nullable = true) |-- money: integer (nullable = true)
Created 11-07-2016 07:41 PM
Here's one way of going about this.
Note the example below is based on the sample data available on the hortonworks sandbox. Just change the database, table and column name to suit you needs
0. Get database and table info
//show databases in Hive
sqlContext.sql("show databases").show
//show table in a database
sqlContext.sql("show tables in default").show
//read the table headers
sqlContext.sql("select * from default.sample_07").printSchema
result
--------+ | result| +--------+ | default| |foodmart| | xademo| +--------+ +---------+-----------+ |tableName|isTemporary| +---------+-----------+ |sample_07| false| |sample_08| false| +---------+-----------+ root |-- code: string (nullable = true) |-- description: string (nullable = true) |-- total_emp: integer (nullable = true) |-- salary: integer (nullable = true)
1. Read table data into a DataFrame :
// read data from Hive
val df = sqlContext.sql("select * from default.sample_07")
//Show Table Schema
df.printSchemaresult
root |-- code: string (nullable = true) |-- description: string (nullable = true) |-- total_emp: integer (nullable = true) |-- salary: integer (nullable = true)
2. Change column names
Change a single column name with the withColumnRenamed function
val df_renamed = df.withColumnRenamed("salary", "money")
df_renamed.printSchema Or all at once using a list of header
val newNames = Seq("code_1", "description_1", "total_emp_1", "money_1")
val df_renamed = df.toDF(newNames: _*)
df_renamed.printSchema Note you can combine reading toghether so as not to create 2 sets of data in memory
val newNames = Seq("code_1", "description_1", "total_emp_1", "money_1")
val df = sqlContext.sql("select * from default.sample_07").toDF(newNames: _*)Or all at once using SQL alias (** preferred)
val df = sqlContext.sql("select code as code_1, description as description_1, total_emp as total_emp_1, salary as money from default.sample_07")
df.printSchemaresult (using SQL alias)
df: org.apache.spark.sql.DataFrame = [code_1: string, description_1: string, total_emp_1: int, money: int] root |-- code_1: string (nullable = true) |-- description_1: string (nullable = true) |-- total_emp_1: integer (nullable = true) |-- money: integer (nullable = true)
3. Save back to hive
//write to Hive (in ORC format)
df.write.format("orc").saveAsTable("default.sample_07_new_schema")
//read back and check new_schema
sqlContext.sql("select * from default.sample_07_new_schema").printSchema
result
root |-- code_1: string (nullable = true) |-- description_1: string (nullable = true) |-- total_emp_1: integer (nullable = true) |-- money: integer (nullable = true)
Created 11-08-2016 07:29 AM
Hi @Matthieu Lamairesse
Error :
scala> df.write.format("orc").saveAsTable("default.sample_07_new_schema") <console>:33: error: value write is not a member of org.apache.spark.sql.DataFrame df.write.format("orc").saveAsTable("default.sample_07_new_schema")
^
Created 11-08-2016 11:37 AM
I've simplified my answer a bit. What version of spark are you using ? This was tested on Spark 1.6.2 on a HDP 2.5 sandbox
Note : When using spark-shell did you import :
import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._
Created 11-09-2016 06:11 AM
i already import
import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql._
still i have the same issue i am using HDP 2.3
Created 11-09-2016 05:23 PM
um which version of Spark ?
1.3.1 => HDP 2.3.0
1.4.1 => HDP 2.3.2
1.5.2 => HDP 2.3.4
I have a feeling it's spark 1.3, they made some major improvement in spark <=> Hive integration starting with spark 1.4.1.
Created 11-09-2016 06:19 PM
The write function was implemented in 1.4.1...
Try simply :
df.saveAsTable("default.sample_07_new_schema")
It will be saved as Parquet (default format for Spark)