Created 03-23-2017 06:15 PM
Have a problem with cyrillic symbols at hive tables. Installed versions:
ambari-server 2.4.2.0-136 hive-2-5-3-0-37 1.2.1000.2.5.3.0-37 Ubuntu 14.04
Whats the problem:
spark@hadoop:~$ locale LANG=ru_RU.UTF-8 LANGUAGE=ru_RU:ru LC_CTYPE="ru_RU.UTF-8" LC_NUMERIC="ru_RU.UTF-8" LC_TIME="ru_RU.UTF-8" LC_COLLATE="ru_RU.UTF-8" LC_MONETARY="ru_RU.UTF-8" LC_MESSAGES="ru_RU.UTF-8" LC_PAPER="ru_RU.UTF-8" LC_NAME="ru_RU.UTF-8" LC_ADDRESS="ru_RU.UTF-8" LC_TELEPHONE="ru_RU.UTF-8" LC_MEASUREMENT="ru_RU.UTF-8" LC_IDENTIFICATION="ru_RU.UTF-8" LC_ALL=ru_RU.UTF-8
spark@hadoop:~$ beeline -n spark -u jdbc:hive2://spark@hadoop.domain.com:10000/ Connecting to enter code herejdbc:hive2://spark@hadoop.domain.com:10000/ Connected to: Apache Hive (version 1.2.1000.2.5.3.0-37) Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1000.2.5.3.0-37 by Apache Hive 0: jdbc:hive2://spark@hadoop.domain.com> CREATE TABLE `test`(`name` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.encoding'='UTF-8'); No rows affected (0,127 seconds)
0: jdbc:hive2://spark@hadoop.domain.com> insert into test values('привет'); INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: insert into test values('привет')(Stage-1) INFO : INFO : Status: Running (Executing on YARN cluster with App id application_1490211406894_2481) INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Loading data to table default.test from hdfs://hadoop.domain.com:8020/apps/hive/warehouse/test/.hive-staging_hive_2017-03-23_13-41-46_215_3133047104896717605-116/-ext-10000 INFO : Table default.test stats: [numFiles=1, numRows=1, totalSize=7, rawDataSize=6] No rows affected (6,652 seconds)
0: jdbc:hive2://spark@hadoop.domain.com> select * from test; +------------+--+ | test.name | +------------+--+ | ?@825B | +------------+--+ 1 row selected (0,162 seconds)
I've read a lot of bugs at apache hive, tested unicode, utf-8, utf-16, some isos encodings with no luck.
Can somebody help me with that?
Thanks!
Created 03-23-2017 11:30 PM
Inserting non-ascii strings from the command line might not work for a number of reasons, and even if it works it's inpractical for large data. Instead, put all your strings, any script supported by UTF-8 including cyrillic in a file, upload the file to HDFS, create an external table based on that file and try to explore the table using for example "... WHERE name='привет' ", that should work. Note that for a new table there is no need to declare serialization.encoding='UTF-8', it will be UTF-8 by default. Ditto for external tables already using UTF-8 data. You need it only if the input file is in non-UTF-8 character set like KOI8-R, and in such a case it will be serialization.encoding='KOI8-R'. For more details on using native charsets in Hive see my article.
Created 03-24-2017 10:45 AM
Hello!
Thanks for your reply.
I've tested with external file at hdfs and it works fine. Thanks.
But could somebody explain me why insert fails with cyrillic symbols? I've created small script in python with reads file.csv and runs insert against hive table. Also i've uploaded that file.csv to hdfs and created external table from it. So the results are different. As for external table hive works fine - shows me correct cyrillic symbols.
But values which were inserted by python from same file are incorrect.
So. Is that the only way to use cyrillic? Should i just write files to hdfs and use external table? Why insert doesn't work?
Thanks!
Created 03-25-2017 05:52 AM
Well, it seems to be a bug, reported but unattended: HIVE-13983. A workaround is to use INSERT INTO ... SELECT like
insert into test select 'привет' from test limit 1;