question Re: SPARK, Hive : ORC does not support type conversion from STRING to VARCHAR in Support Questions

SPARK, Hive : ORC does not support type conversion from STRING to VARCHAR

ram_pratapa — Wed, 20 Jul 2016 10:24:46 GMT

I Sqooped the data from SQL server and stored the data in Hive in ORC file in a data warehouse as table testtable. I read the data using spark into a dataframe. Added a column using withColumn to dataframe and issued an alter to add the column

alter table testtable add columns (PatMD5 VARCHAR(50) using hiveContext.sql and it is changing the table and I saved the dataframe using the following

dataframe.write.format("orc").mode(SaveMode.Overwrite).save("testtable")

I am able to save the file into ORC. But when I tried to query using Hue or Beeline, I am getting the following error

ORC does not support type conversion from STRING to VARCHAR

I tried with

alter table testtable add columns (PatMD5 STRING)

I am able to save the file in ORC but not able to query from hive. Can any one help.

thanks in advance

Ram

Re: SPARK, Hive : ORC does not support type conversion from STRING to VARCHAR

rajkumar_singh — Wed, 20 Jul 2016 12:01:52 GMT

it seems ORC dont support schema evolution

Re: SPARK, Hive : ORC does not support type conversion from STRING to VARCHAR

aervits — Wed, 20 Jul 2016 19:36:52 GMT

ORC schema evolution is added in V. 1.1 which is not available with HDP 2.4 or older.

Re: SPARK, Hive : ORC does not support type conversion from STRING to VARCHAR

don_jernigan — Thu, 21 Jul 2016 02:49:43 GMT

Can you please do a SHOW CREATE TABLE testtable: ? That will show us exactly how your columns are defined and the ORC format.

Re: SPARK, Hive : ORC does not support type conversion from STRING to VARCHAR

ram_pratapa — Fri, 22 Jul 2016 04:06:29 GMT

Hi, thank you for your reply. I will post the results. However I followed these steps.

a) Loaded the data from existing table testtable into dataframe using HiveContext

b) Added a column using withColumn to dataframe

c) Created the new table (testtabletmp) using Spark SQL with new column that saves as ORC

d) Save the data frame as ORC dataframe.write.format("orc").save("testtabletmp")

With the above steps, I am able to access the table from Hive. I will post the results related to SHOW CREATE TABLE testtable tomorrow.

thanks

Ram

Re: SPARK, Hive : ORC does not support type conversion from STRING to VARCHAR

ram_pratapa — Fri, 22 Jul 2016 23:29:43 GMT

I executed the above statement and I indentified that we created a table with

I noticed that while storing ORC file I did not provide compress option and I used option("compression", "snappy") while saving the file and it appears the compression is not working. can you please help.

thanks

Ram

Re: SPARK, Hive : ORC does not support type conversion from STRING to VARCHAR

ram_pratapa — Sat, 23 Jul 2016 01:03:44 GMT

Here are the details :

a) The following is the show create table testtable results ( this table is created with Spark SQL

CREATE TABLE `testtabletmp1`( `person_key` bigint, `pat_last` string, `pat_first` string, `pat_dob` timestamp, `pat_zip` string, `pat_gender` string, `pat_chksum1` bigint, `pat_chksum2` bigint, `dimcreatedgmt` timestamp, `pat_mi` string, `h_keychksum` string, `patmd5` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hdp-cent7-01:8020/apps/hive/warehouse/datawarehouse.db/testtabledimtmp1' | TBLPROPERTIES ( 'orc.compress'='SNAPPY', 'transient_lastDdlTime'='1469207216')

2. The original table create when we scooped the data from SQL server using SQOOP import

CREATE TABLE `testtabledim`( `person_key` bigint, `pat_last` varchar(35), `pat_first` varchar(35), `pat_dob` timestamp, `pat_zip` char(5), `pat_gender` char(1), `pat_chksum1` bigint, `pat_chksum2` bigint, `dimcreatedgmt` timestamp, `pat_mi` char(1), `h_keychksum` string, `patmd5` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' LOCATION 'hdfs://hdp-cent7-01:8020/apps/hive/warehouse/datawarehouse.db/testtabledim' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='false', 'last_modified_by'='hdfs', 'last_modified_time'='1469026541', 'numFiles'='1', 'numRows'='-1', 'orc.compress'='SNAPPY', 'rawDataSize'='-1', 'totalSize'='11144909', 'transient_lastDdlTime'='1469026541')

If use the first script using spark sql and store the file as ORC with snappy compression it is working. if I store ORC file with snappy compression and use hive to create table using script 1 then it is working fine. But I use an existing table alter table with a new coulmn using the Spark Hive context and save as ORC with snappy compression, I am getting the following error ORC does not support type conversion from STRING to VARCHAR. if use the same ORC but use hive to create a table using second query even then I am getting the same error.

I noticed some columns are defined as VARCHAR(35) and I think those columns may be the issue.

After I made the change from VARCHAR to String and CHAR to String, it worked fine. I am still investigating what is the best way to handle VARCHAR/CHAR types through Spark dataframe.

Please let me know if you need more information.

Thank you for your help.

Re: SPARK, Hive : ORC does not support type conversion from STRING to VARCHAR

alisa_houskova — Mon, 24 Jul 2017 19:06:23 GMT

I suggest to read you this topic. Might be helpful.