Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Case and accent insensitive ?

avatar
New Contributor

Hi there,

 

Mi issue is very basic, I'm looking for the equivalent function in Hive of this SQL function => 

COLLATE SQL_Latin1_General_CP1_CI_AI

It's very important for me because I work on towns like Sao Paulo which have different writing versions. I have an old topic regarding this subject on stackoverflow, which unfortunately didn't get any interesting answer.... 

 

Could you provide me help please ?

 

Best regards,

Nicolas.

2 REPLIES 2

avatar
Expert Contributor

You may set the character set used by Hive, for a given table, with the Table SerDe property "serialization.encoding".  Take a look at the follow JIRA for an example on how to use it:

 

https://issues.apache.org/jira/browse/HIVE-12653

 

If you would like to use the MultiDelimitSerDe class, referenced in HIVE-12653, this serialization feature is available starting in CDH 5.10.

 

https://www.cloudera.com/documentation/enterprise/release-notes/topics/cdh_rn_fixed_in_510.html

 

 

The valid Character Sets are discussed in the following link.  In particular, take a look at the "Standard charsets."

 

https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html

 

avatar
New Contributor

HI Nicolas,

Did you get any resolution for this ?
I also have a similar requirement where I need to do case-insensitive joins across the system and I don't wish to apply upper/lower functions. 
I tried setting TBLPROPERTIES('serialization.encoding'='utf8mb4_unicode_ci') at Table level but still the comparison is happening considering case sensitivity.  PSB -

drop table test.caseI;
create table test.caseI
(name string, id int)
TBLPROPERTIES('serialization.encoding'='utf8mb4_unicode_ci');
insert into test.caseI values ('hj',1);

drop table test.caseI_2;
create table test.caseI_2
(name string, id int)
TBLPROPERTIES('serialization.encoding'='utf8mb4_unicode_ci');
insert into test.caseI_2 values ('HJ',1);


select * from test.caseI i
inner join test.caseI_2 i2 on i.name=i2.name;
--No Result
Tried with encoding 'SQL_Latin1_General_CP1_CI_AI' but got same result as above.

Any help would be appreciated, thanks!