Does by default hive support UTF-8 encoding? If it does not, How do i make the entire hive database to support UTF-8 encoding? I am getting an issue while transfering Sql-server tables to hive. I am seeing corrupted strings. I know that i can alter hive table with setting serde.encoding to UTF-8 but is there a way to set the entire hive database to UTF-8. Any help would be appreciated
Did you try the following format while creating hive table?
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ WITH SERDEPROPERTIES(“serialization.encoding”=’UTF-8′);
Hive does support UTF-8 encoding of data. As @jk has shown, you can create the table using the LazySimpleSerDe. You can read more about it Hive's UTF support here:
You can use Unicode string on data/comments, but cannot use for database/table/column name. You can use UTF-8 encoding for Hive data. However, other encodings are not supported (HIVE-7142 introduce encoding for LazySimpleSerDe, however, the implementation is not complete and not address all cases).
Hive default encoding is UTF8, and therefore setting serialization.encoding to UTF8 on a file in UTF8 is unnecessary. However, if you are facing troubles, there is a high probability that your input file is using another character set. In that case set 'serialization.encoding' to the encoding of the input file. A quick search show that the default charset of Sql server is ISO-8859-1 (alias latin1), so you can try 'serialization.encoding'='ISO-8859-1'. For examples see my recent article on Hive charsets.