Support Questions
Find answers, ask questions, and share your expertise

UTF-8 hive

Contributor

Does by default hive support UTF-8 encoding? If it does not, How do i make the entire hive database to support UTF-8 encoding? I am getting an issue while transfering Sql-server tables to hive. I am seeing corrupted strings. I know that i can alter hive table with setting serde.encoding to UTF-8 but is there a way to set the entire hive database to UTF-8. Any help would be appreciated

Thanks

5 REPLIES 5

Contributor

@Michael Young Any thoughts?

@Praneender Vuppala

Did you try the following format while creating hive table?

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ WITH SERDEPROPERTIES(“serialization.encoding”=’UTF-8′);

Also please see: https://community.hortonworks.com/questions/54162/why-hive-is-not-able-to-store-special-characters-l...

@Praneender Vuppala

Hive does support UTF-8 encoding of data. As @jk has shown, you can create the table using the LazySimpleSerDe. You can read more about it Hive's UTF support here:

Hive User FAQ

You can use Unicode string on data/comments, but cannot use for database/table/column name.

You can use UTF-8 encoding for Hive data. However, other encodings are not 
supported (HIVE-7142 introduce encoding for LazySimpleSerDe, however, 
the implementation is not complete and not address all cases).

Hive default encoding is UTF8, and therefore setting serialization.encoding to UTF8 on a file in UTF8 is unnecessary. However, if you are facing troubles, there is a high probability that your input file is using another character set. In that case set 'serialization.encoding' to the encoding of the input file. A quick search show that the default charset of Sql server is ISO-8859-1 (alias latin1), so you can try 'serialization.encoding'='ISO-8859-1'. For examples see my recent article on Hive charsets.

New Contributor

Not working fine for