Support Questions

Find answers, ask questions, and share your expertise

SPARK, Hive : ORC does not support type conversion from STRING to VARCHAR

avatar
Rising Star

I Sqooped the data from SQL server and stored the data in Hive in ORC file in a data warehouse as table testtable. I read the data using spark into a dataframe. Added a column using withColumn to dataframe and issued an alter to add the column

alter table testtable add columns (PatMD5 VARCHAR(50) using hiveContext.sql and it is changing the table and I saved the dataframe using the following

dataframe.write.format("orc").mode(SaveMode.Overwrite).save("testtable")

I am able to save the file into ORC. But when I tried to query using Hue or Beeline, I am getting the following error

ORC does not support type conversion from STRING to VARCHAR

I tried with

alter table testtable add columns (PatMD5 STRING)

I am able to save the file in ORC but not able to query from hive. Can any one help.

thanks in advance

Ram

7 REPLIES 7

avatar
Super Guru

it seems ORC dont support schema evolution

avatar
Master Mentor

ORC schema evolution is added in V. 1.1 which is not available with HDP 2.4 or older.

avatar
Contributor

Can you please do a SHOW CREATE TABLE testtable: ? That will show us exactly how your columns are defined and the ORC format.

avatar
Rising Star

I executed the above statement and I indentified that we created a table with

TBLPROPERTIES ( | | 'COLUMN_STATS_ACCURATE'='false', | | 'last_modified_by'='hdfs', | | 'last_modified_time'='1469026541', | | 'numFiles'='1', | | 'numRows'='-1', | | 'orc.compress'='SNAPPY', | | 'rawDataSize'='-1', | | 'totalSize'='11144909', | | 'transient_lastDdlTime'='1469026541'

I noticed that while storing ORC file I did not provide compress option and I used option("compression", "snappy") while saving the file and it appears the compression is not working. can you please help.

thanks

Ram

avatar
Rising Star

Here are the details :

a) The following is the show create table testtable results ( this table is created with Spark SQL

CREATE TABLE `testtabletmp1`( `person_key` bigint, `pat_last` string, `pat_first` string, `pat_dob` timestamp, `pat_zip` string, `pat_gender` string, `pat_chksum1` bigint, `pat_chksum2` bigint, `dimcreatedgmt` timestamp, `pat_mi` string, `h_keychksum` string, `patmd5` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hdp-cent7-01:8020/apps/hive/warehouse/datawarehouse.db/testtabledimtmp1' | TBLPROPERTIES ( 'orc.compress'='SNAPPY', 'transient_lastDdlTime'='1469207216')

2. The original table create when we scooped the data from SQL server using SQOOP import

CREATE TABLE `testtabledim`( `person_key` bigint, `pat_last` varchar(35), `pat_first` varchar(35), `pat_dob` timestamp, `pat_zip` char(5), `pat_gender` char(1), `pat_chksum1` bigint, `pat_chksum2` bigint, `dimcreatedgmt` timestamp, `pat_mi` char(1), `h_keychksum` string, `patmd5` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hdp-cent7-01:8020/apps/hive/warehouse/datawarehouse.db/testtabledim' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'last_modified_by'='hdfs', 'last_modified_time'='1469026541', 'numFiles'='1', 'numRows'='-1', 'orc.compress'='SNAPPY', 'rawDataSize'='-1', 'totalSize'='11144909', 'transient_lastDdlTime'='1469026541')

If use the first script using spark sql and store the file as ORC with snappy compression it is working. if I store ORC file with snappy compression and use hive to create table using script 1 then it is working fine. But I use an existing table alter table with a new coulmn using the Spark Hive context and save as ORC with snappy compression, I am getting the following error ORC does not support type conversion from STRING to VARCHAR. if use the same ORC but use hive to create a table using second query even then I am getting the same error.

I noticed some columns are defined as VARCHAR(35) and I think those columns may be the issue.

After I made the change from VARCHAR to String and CHAR to String, it worked fine. I am still investigating what is the best way to handle VARCHAR/CHAR types through Spark dataframe.

Please let me know if you need more information.

Thank you for your help.

avatar
Rising Star

Hi, thank you for your reply. I will post the results. However I followed these steps.

a) Loaded the data from existing table testtable into dataframe using HiveContext

b) Added a column using withColumn to dataframe

c) Created the new table (testtabletmp) using Spark SQL with new column that saves as ORC

d) Save the data frame as ORC dataframe.write.format("orc").save("testtabletmp")

With the above steps, I am able to access the table from Hive. I will post the results related to SHOW CREATE TABLE testtable tomorrow.

thanks

Ram

avatar

I suggest to read you this topic. Might be helpful.