Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Master Guru

hcc-58548.zipUntil recently, Hive could only read and write UTF-8 text files, and no other character sets were supported forcing people to convert their possibly huge and/or multiple input files to UTF-8 using "iconv" or other such utility which can be cumbersome (for example, iconv supports only files smaller than 16G), and time-consuming. However, since Hive-0.14, by way of LazySimpleSerDe, conversion from and to other character sets is supported. The SerDe can be specified by, for example

hive> CREATE TABLE mytable(...) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES("serialization.encoding"='SJIS');

or by declaring the table in a usual way, omitting the SerDe, specifying the field delimiter, and then doing:

hive> ALTER TABLE mytable SET SERDEPROPERTIES ('serialization.encoding'='SJIS');

where, instead of "SJIS" one can insert the encoding of the input file (note that char-set self-detection is not supported), like for example one of ISO-8859-x encodings for Indo-Europian languages, SJIS for Japanese, GBK for Chinese and so on.

Using SJIS and ISO-8859-1 Character Sets

SJIS has been widely used in Japan, specially on feature mobile phones, because it results in smaller text files. ISO-8859-x have been used in Europe, and they also result in smaller text files. [Files used in examples are available in the attachment.] Let's create an external table on an SJIS file, with fields: id, city name in the native language, city name in English, country, and population (in millions).

$ hdfs dfs -mkdir -p hive/sjis/cj
$ hdfs dfs -put cjcities.sjis hive/sjis/cj
hive> CREATE EXTERNAL TABLE city_sjis(id int, city_org string, city_en string, cnt string, ppl float) row format delimited fields terminated by ',' location '/user/it1/hive/sjis/cj'; 

Hive attempts to interpret SJIS as UTF-8, and that results in a garbled print-out:

hive> select * from city_sjis limit 1;
1����TokyoJapan33.8 

After setting the character set, the print-out is correct. Note that SJIS also supports Greek alphabet and Russian cyrillic.

hive> ALTER TABLE city_sjis SET SERDEPROPERTIES ('serialization.encoding'='SJIS');
hive> select * from city_sjis;
1	東京	Tokyo	Japan	33.8
2	大阪	Osaka	Japan	16.7
11	北京	Beijing	China	13.2
12	廣州	Guangzhou	China	15.3
21	Αθηνα	Athens	Greece	3.7
31	Якутск	Yakutsk	Russia	0.6 

Conditions expressed in non-ascii characters also work:

hive> select * from city_sjis where city_org like '%京';
1	東京	Tokyo	Japan	33.8
11	北京	Beijing	China	13.2 

Similarly, for ISO-8859-1:

$ hdfs dfs -mkdir -p hive/sjis/eu
$ hdfs dfs -put eucities.iso-8859-1 hive/sjis/eu
hive> CREATE EXTERNAL TABLE city_isolatin(id int, city_org string, city_en string, cnt string, ppl float) row format delimited fields terminated by ',' location '/user/it1/hive/sjis/eu';
hive> ALTER TABLE city_isolatin SET SERDEPROPERTIES ('serialization.encoding'='ISO-8859-1');
hive> select * from city_isolatin;
110	La Coruña	Corunna	Spain	0.37
112	Cádiz	Cadiz	Spain	0.4
120	Köln	Cologne	Germany	0.97
121	München	Munich	Germany	1.2
130	Tårnby	Tarnby	Danmark	0.04
140	Tønsberg	Tonsberg	Norway	0.05
150	Besançon	Bisanz	France	0.12
hive> select * from city_isolatin where city_org like '%å%';
130TårnbyTarnbyDanmark0.04 

Note that we can also use 'latin1', a well-known alias of 'ISO-8859-1' (or 'IBM819', 'CP819'). We can now consolidate both tables into a single, UTF-8 one:

hive> create table city as select * from city_sjis;
hive> insert into city select * from city_isolatin;
hive> select * from city where id=1 or id=110;
1	東京	Tokyo	Japan	33.8
110	La Coruña	Corunna	Spain	0.37 

The above examples show how to read files from other encodings, however one can also write files to other encodings, by setting the 'serialization.encoding' on the newly created table and doing "INSERT INTO" that table.

About UTF-16

Conversion from/to UTF-16 works as well, however, the behavior might be somewhat unexpected.

  • Writing to a table with encoding set to 'UTF-16' creates a text file so that its every line begins with the so-called BOM (byte order mark) character U+FEFF (2 bytes). This is a non-standard behavior since BOM is usually placed only once, at the very beginning of the file.
  • Writing to a table with encoding set to 'UTF-16LE' or 'UTF-16BE' (Low/Big Endian) creates a text file with no BOM's, this is a standard behavior.
  • In either case, delimiters such as Hive default Ctr-A, or comma or tab are represented in 2 bytes, however the new line LF character is represented in only 1 byte (0x0A), instead of UTF-16 2-byte equivalent 0x000A.

Keep this in mind when attempting to read a UTF-16 file in Hive. If your file doesn't satisfy above conditions you can consider converting your file to UTF-8 using for example iconv with the "-f UTF-16" or "-f UTF-16LE" options, or convert/create input files programmatically to satisfy above conditions.

Related articles: Hive table with UTF-16 data

Note: All tests were done on a HDP-2.4.0 cluster running Hive-1.2.1.

50,540 Views
Comments
avatar
New Contributor

This is a very helpful post; thank you for creating it.

I have some more questions, though.

Firstly I could not find this information in the documentation. Specifically I'd like a valid list of the encodings that are supported. We have a lot of legacy data, some of which is in the Windows-1252 (alias CP-1252) character encoding. It would be great if we could see that as it is meant to be.

Secondly, this shows how to see data that is encoded in a particular character set. We would like to get everything converted to UTF-8 ultimately. Can we do this by create ... as select ..., with the source being a table with encoding set to WIndows-1252 and the target set to UTF-8, or will the data still look mangled in the new UTF-8 table?

avatar
Master Guru

Hi @Ron Ballard, the list of encodings depends on what's installed on your machines. You can list them using for example "iconv -l". CP1252 is installed on Hortonworks Sandbox and has another 2 aliases:

$ iconv -l | grep 1252
CP1252 MS-ANSI WINDOWS-1252

Regarding your second question, the answer is Yes, and that's exactly what I'm doing above. "city" is using the default UTF-8 encoding and city_sjis is in SJIS:

create table city as select * from city_sjis;
avatar
New Contributor

Hi Predrag,

We are using MultiDelimitSerDe that as far as i understand is built on top of LazySimpleSerDe it looks like the serialization.encoding param does not have any effect.

File encoding is: ISO-8859 text

what ever encoding i place into SERDEPROPERTIES does not have any effect...do you know maybe what might be the issue?

we are using hortonworks hdp 2.5.0.0 and the table ddl is as follows:

CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES_1 ( list of columns) PARTITIONED BY ( columns) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe' WITH SERDEPROPERTIES ( "field.delim"="||", "serialization.encoding"="ISO8859_1") LOCATION 'file/location' tblproperties("skip.header.line.count"="1");

Regards,

dalibor