Created 07-20-2016 03:24 AM
I Sqooped the data from SQL server and stored the data in Hive in ORC file in a data warehouse as table testtable. I read the data using spark into a dataframe. Added a column using withColumn to dataframe and issued an alter to add the column
alter table testtable add columns (PatMD5 VARCHAR(50) using hiveContext.sql and it is changing the table and I saved the dataframe using the following
dataframe.write.format("orc").mode(SaveMode.Overwrite).save("testtable")
I am able to save the file into ORC. But when I tried to query using Hue or Beeline, I am getting the following error
ORC does not support type conversion from STRING to VARCHAR
I tried with
alter table testtable add columns (PatMD5 STRING)
I am able to save the file in ORC but not able to query from hive. Can any one help.
thanks in advance
Ram
Created 07-20-2016 05:01 AM
it seems ORC dont support schema evolution
Created 07-20-2016 12:36 PM
ORC schema evolution is added in V. 1.1 which is not available with HDP 2.4 or older.
Created 07-20-2016 07:49 PM
Can you please do a SHOW CREATE TABLE testtable: ? That will show us exactly how your columns are defined and the ORC format.
Created 07-22-2016 04:29 PM
I executed the above statement and I indentified that we created a table with
TBLPROPERTIES ( | | 'COLUMN_STATS_ACCURATE'='false', | | 'last_modified_by'='hdfs', | | 'last_modified_time'='1469026541', | | 'numFiles'='1', | | 'numRows'='-1', | | 'orc.compress'='SNAPPY', | | 'rawDataSize'='-1', | | 'totalSize'='11144909', | | 'transient_lastDdlTime'='1469026541'
I noticed that while storing ORC file I did not provide compress option and I used option("compression", "snappy") while saving the file and it appears the compression is not working. can you please help.
thanks
Ram
Created 07-22-2016 06:03 PM
Here are the details :
a) The following is the show create table testtable results ( this table is created with Spark SQL
CREATE TABLE `testtabletmp1`( `person_key` bigint, `pat_last` string, `pat_first` string, `pat_dob` timestamp, `pat_zip` string, `pat_gender` string, `pat_chksum1` bigint, `pat_chksum2` bigint, `dimcreatedgmt` timestamp, `pat_mi` string, `h_keychksum` string, `patmd5` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hdp-cent7-01:8020/apps/hive/warehouse/datawarehouse.db/testtabledimtmp1' | TBLPROPERTIES ( 'orc.compress'='SNAPPY', 'transient_lastDdlTime'='1469207216')
2. The original table create when we scooped the data from SQL server using SQOOP import
CREATE TABLE `testtabledim`( `person_key` bigint, `pat_last` varchar(35), `pat_first` varchar(35), `pat_dob` timestamp, `pat_zip` char(5), `pat_gender` char(1), `pat_chksum1` bigint, `pat_chksum2` bigint, `dimcreatedgmt` timestamp, `pat_mi` char(1), `h_keychksum` string, `patmd5` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hdp-cent7-01:8020/apps/hive/warehouse/datawarehouse.db/testtabledim' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'last_modified_by'='hdfs', 'last_modified_time'='1469026541', 'numFiles'='1', 'numRows'='-1', 'orc.compress'='SNAPPY', 'rawDataSize'='-1', 'totalSize'='11144909', 'transient_lastDdlTime'='1469026541')
If use the first script using spark sql and store the file as ORC with snappy compression it is working. if I store ORC file with snappy compression and use hive to create table using script 1 then it is working fine. But I use an existing table alter table with a new coulmn using the Spark Hive context and save as ORC with snappy compression, I am getting the following error ORC does not support type conversion from STRING to VARCHAR. if use the same ORC but use hive to create a table using second query even then I am getting the same error.
I noticed some columns are defined as VARCHAR(35) and I think those columns may be the issue.
After I made the change from VARCHAR to String and CHAR to String, it worked fine. I am still investigating what is the best way to handle VARCHAR/CHAR types through Spark dataframe.
Please let me know if you need more information.
Thank you for your help.
Created 07-21-2016 09:06 PM
Hi, thank you for your reply. I will post the results. However I followed these steps.
a) Loaded the data from existing table testtable into dataframe using HiveContext
b) Added a column using withColumn to dataframe
c) Created the new table (testtabletmp) using Spark SQL with new column that saves as ORC
d) Save the data frame as ORC dataframe.write.format("orc").save("testtabletmp")
With the above steps, I am able to access the table from Hive. I will post the results related to SHOW CREATE TABLE testtable tomorrow.
thanks
Ram
Created 07-24-2017 12:06 PM
I suggest to read you this topic. Might be helpful.