Support Questions

vasiliyozerov · ‎03-23-2017

Have a problem with cyrillic symbols at hive tables. Installed versions:

ambari-server 2.4.2.0-136
hive-2-5-3-0-37 1.2.1000.2.5.3.0-37
Ubuntu 14.04

Whats the problem:

Set locale to ru_RU.UTF-8:

spark@hadoop:~$ locale
LANG=ru_RU.UTF-8
LANGUAGE=ru_RU:ru
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=ru_RU.UTF-8

Connect to hive and create test table:

spark@hadoop:~$ beeline -n spark -u jdbc:hive2://spark@hadoop.domain.com:10000/

Connecting to enter code herejdbc:hive2://spark@hadoop.domain.com:10000/
Connected to: Apache Hive (version 1.2.1000.2.5.3.0-37)
Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.5.3.0-37 by Apache Hive

0: jdbc:hive2://spark@hadoop.domain.com> CREATE TABLE `test`(`name` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.encoding'='UTF-8');
No rows affected (0,127 seconds)

Insert cyrillic symbols:

0: jdbc:hive2://spark@hadoop.domain.com> insert into test values('привет');

INFO  : Tez session hasn't been created yet. Opening session
INFO  : Dag name: insert into test values('привет')(Stage-1)
INFO  :

INFO  : Status: Running (Executing on YARN cluster with App id application_1490211406894_2481)

INFO  : Map 1: -/-
INFO  : Map 1: 0/1
INFO  : Map 1: 0(+1)/1
INFO  : Map 1: 1/1
INFO  : Loading data to table default.test from hdfs://hadoop.domain.com:8020/apps/hive/warehouse/test/.hive-staging_hive_2017-03-23_13-41-46_215_3133047104896717605-116/-ext-10000
INFO  : Table default.test stats: [numFiles=1, numRows=1, totalSize=7, rawDataSize=6]
No rows affected (6,652 seconds)

Select from table:

0: jdbc:hive2://spark@hadoop.domain.com> select * from test;
+------------+--+
| test.name  |
+------------+--+
| ?@825B     |
+------------+--+
1 row selected (0,162 seconds)

I've read a lot of bugs at apache hive, tested unicode, utf-8, utf-16, some isos encodings with no luck.

Can somebody help me with that?

Thanks!

pminovic · ‎03-23-2017

Inserting non-ascii strings from the command line might not work for a number of reasons, and even if it works it's inpractical for large data. Instead, put all your strings, any script supported by UTF-8 including cyrillic in a file, upload the file to HDFS, create an external table based on that file and try to explore the table using for example "... WHERE name='привет' ", that should work. Note that for a new table there is no need to declare serialization.encoding='UTF-8', it will be UTF-8 by default. Ditto for external tables already using UTF-8 data. You need it only if the input file is in non-UTF-8 character set like KOI8-R, and in such a case it will be serialization.encoding='KOI8-R'. For more details on using native charsets in Hive see my article.

vasiliyozerov · ‎03-24-2017

Hello!

Thanks for your reply.

I've tested with external file at hdfs and it works fine. Thanks.

But could somebody explain me why insert fails with cyrillic symbols? I've created small script in python with reads file.csv and runs insert against hive table. Also i've uploaded that file.csv to hdfs and created external table from it. So the results are different. As for external table hive works fine - shows me correct cyrillic symbols.

But values which were inserted by python from same file are incorrect.

So. Is that the only way to use cyrillic? Should i just write files to hdfs and use external table? Why insert doesn't work?

Thanks!

pminovic · ‎03-25-2017

Well, it seems to be a bug, reported but unattended: HIVE-13983. A workaround is to use INSERT INTO ... SELECT like

insert into test select 'привет' from test limit 1;

Cloudera Community

Support Questions

Hive UTF-8 problems