Created on 09-27-2016 06:02 AM
hcc-58548.zipUntil recently, Hive could only read and write UTF-8 text files, and no other character sets were supported forcing people to convert their possibly huge and/or multiple input files to UTF-8 using "iconv" or other such utility which can be cumbersome (for example, iconv supports only files smaller than 16G), and time-consuming. However, since Hive-0.14, by way of LazySimpleSerDe, conversion from and to other character sets is supported. The SerDe can be specified by, for example
hive> CREATE TABLE mytable(...) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES("serialization.encoding"='SJIS');
or by declaring the table in a usual way, omitting the SerDe, specifying the field delimiter, and then doing:
hive> ALTER TABLE mytable SET SERDEPROPERTIES ('serialization.encoding'='SJIS');
where, instead of "SJIS" one can insert the encoding of the input file (note that char-set self-detection is not supported), like for example one of ISO-8859-x encodings for Indo-Europian languages, SJIS for Japanese, GBK for Chinese and so on.
SJIS has been widely used in Japan, specially on feature mobile phones, because it results in smaller text files. ISO-8859-x have been used in Europe, and they also result in smaller text files. [Files used in examples are available in the attachment.] Let's create an external table on an SJIS file, with fields: id, city name in the native language, city name in English, country, and population (in millions).
$ hdfs dfs -mkdir -p hive/sjis/cj $ hdfs dfs -put cjcities.sjis hive/sjis/cj hive> CREATE EXTERNAL TABLE city_sjis(id int, city_org string, city_en string, cnt string, ppl float) row format delimited fields terminated by ',' location '/user/it1/hive/sjis/cj';
Hive attempts to interpret SJIS as UTF-8, and that results in a garbled print-out:
hive> select * from city_sjis limit 1; 1����TokyoJapan33.8
After setting the character set, the print-out is correct. Note that SJIS also supports Greek alphabet and Russian cyrillic.
hive> ALTER TABLE city_sjis SET SERDEPROPERTIES ('serialization.encoding'='SJIS'); hive> select * from city_sjis; 1 東京 Tokyo Japan 33.8 2 大阪 Osaka Japan 16.7 11 北京 Beijing China 13.2 12 廣州 Guangzhou China 15.3 21 Αθηνα Athens Greece 3.7 31 Якутск Yakutsk Russia 0.6
Conditions expressed in non-ascii characters also work:
hive> select * from city_sjis where city_org like '%京'; 1 東京 Tokyo Japan 33.8 11 北京 Beijing China 13.2
Similarly, for ISO-8859-1:
$ hdfs dfs -mkdir -p hive/sjis/eu $ hdfs dfs -put eucities.iso-8859-1 hive/sjis/eu hive> CREATE EXTERNAL TABLE city_isolatin(id int, city_org string, city_en string, cnt string, ppl float) row format delimited fields terminated by ',' location '/user/it1/hive/sjis/eu'; hive> ALTER TABLE city_isolatin SET SERDEPROPERTIES ('serialization.encoding'='ISO-8859-1'); hive> select * from city_isolatin; 110 La Coruña Corunna Spain 0.37 112 Cádiz Cadiz Spain 0.4 120 Köln Cologne Germany 0.97 121 München Munich Germany 1.2 130 Tårnby Tarnby Danmark 0.04 140 Tønsberg Tonsberg Norway 0.05 150 Besançon Bisanz France 0.12 hive> select * from city_isolatin where city_org like '%å%'; 130TårnbyTarnbyDanmark0.04
Note that we can also use 'latin1', a well-known alias of 'ISO-8859-1' (or 'IBM819', 'CP819'). We can now consolidate both tables into a single, UTF-8 one:
hive> create table city as select * from city_sjis; hive> insert into city select * from city_isolatin; hive> select * from city where id=1 or id=110; 1 東京 Tokyo Japan 33.8 110 La Coruña Corunna Spain 0.37
The above examples show how to read files from other encodings, however one can also write files to other encodings, by setting the 'serialization.encoding' on the newly created table and doing "INSERT INTO" that table.
Conversion from/to UTF-16 works as well, however, the behavior might be somewhat unexpected.
Keep this in mind when attempting to read a UTF-16 file in Hive. If your file doesn't satisfy above conditions you can consider converting your file to UTF-8 using for example iconv with the "-f UTF-16" or "-f UTF-16LE" options, or convert/create input files programmatically to satisfy above conditions.
Related articles: Hive table with UTF-16 data
Note: All tests were done on a HDP-2.4.0 cluster running Hive-1.2.1.
Created on 12-09-2016 02:15 PM
This is a very helpful post; thank you for creating it.
I have some more questions, though.
Firstly I could not find this information in the documentation. Specifically I'd like a valid list of the encodings that are supported. We have a lot of legacy data, some of which is in the Windows-1252 (alias CP-1252) character encoding. It would be great if we could see that as it is meant to be.
Secondly, this shows how to see data that is encoded in a particular character set. We would like to get everything converted to UTF-8 ultimately. Can we do this by create ... as select ..., with the source being a table with encoding set to WIndows-1252 and the target set to UTF-8, or will the data still look mangled in the new UTF-8 table?
Created on 12-10-2016 06:23 AM
Hi @Ron Ballard, the list of encodings depends on what's installed on your machines. You can list them using for example "iconv -l". CP1252 is installed on Hortonworks Sandbox and has another 2 aliases:
$ iconv -l | grep 1252 CP1252 MS-ANSI WINDOWS-1252
Regarding your second question, the answer is Yes, and that's exactly what I'm doing above. "city" is using the default UTF-8 encoding and city_sjis is in SJIS:
create table city as select * from city_sjis;
Created on 01-05-2017 01:54 PM
Hi Predrag,
We are using MultiDelimitSerDe that as far as i understand is built on top of LazySimpleSerDe it looks like the serialization.encoding param does not have any effect.
File encoding is: ISO-8859 text
what ever encoding i place into SERDEPROPERTIES does not have any effect...do you know maybe what might be the issue?
we are using hortonworks hdp 2.5.0.0 and the table ddl is as follows:
CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES_1 ( list of columns) PARTITIONED BY ( columns) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES ( "field.delim"="||", "serialization.encoding"="ISO8859_1") LOCATION 'file/location' tblproperties("skip.header.line.count"="1");
Regards,
dalibor