- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive UTF-8 problems
- Labels:
-
Apache Ambari
Created ‎03-23-2017 06:15 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have a problem with cyrillic symbols at hive tables. Installed versions:
ambari-server 2.4.2.0-136 hive-2-5-3-0-37 1.2.1000.2.5.3.0-37 Ubuntu 14.04
Whats the problem:
- Set locale to ru_RU.UTF-8:
spark@hadoop:~$ locale LANG=ru_RU.UTF-8 LANGUAGE=ru_RU:ru LC_CTYPE="ru_RU.UTF-8" LC_NUMERIC="ru_RU.UTF-8" LC_TIME="ru_RU.UTF-8" LC_COLLATE="ru_RU.UTF-8" LC_MONETARY="ru_RU.UTF-8" LC_MESSAGES="ru_RU.UTF-8" LC_PAPER="ru_RU.UTF-8" LC_NAME="ru_RU.UTF-8" LC_ADDRESS="ru_RU.UTF-8" LC_TELEPHONE="ru_RU.UTF-8" LC_MEASUREMENT="ru_RU.UTF-8" LC_IDENTIFICATION="ru_RU.UTF-8" LC_ALL=ru_RU.UTF-8
- Connect to hive and create test table:
spark@hadoop:~$ beeline -n spark -u jdbc:hive2://spark@hadoop.domain.com:10000/ Connecting to enter code herejdbc:hive2://spark@hadoop.domain.com:10000/ Connected to: Apache Hive (version 1.2.1000.2.5.3.0-37) Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1000.2.5.3.0-37 by Apache Hive 0: jdbc:hive2://spark@hadoop.domain.com> CREATE TABLE `test`(`name` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.encoding'='UTF-8'); No rows affected (0,127 seconds)
- Insert cyrillic symbols:
0: jdbc:hive2://spark@hadoop.domain.com> insert into test values('привет'); INFO : Tez session hasn't been created yet. Opening session INFO : Dag name: insert into test values('привет')(Stage-1) INFO : INFO : Status: Running (Executing on YARN cluster with App id application_1490211406894_2481) INFO : Map 1: -/- INFO : Map 1: 0/1 INFO : Map 1: 0(+1)/1 INFO : Map 1: 1/1 INFO : Loading data to table default.test from hdfs://hadoop.domain.com:8020/apps/hive/warehouse/test/.hive-staging_hive_2017-03-23_13-41-46_215_3133047104896717605-116/-ext-10000 INFO : Table default.test stats: [numFiles=1, numRows=1, totalSize=7, rawDataSize=6] No rows affected (6,652 seconds)
- Select from table:
0: jdbc:hive2://spark@hadoop.domain.com> select * from test; +------------+--+ | test.name | +------------+--+ | ?@825B | +------------+--+ 1 row selected (0,162 seconds)
I've read a lot of bugs at apache hive, tested unicode, utf-8, utf-16, some isos encodings with no luck.
Can somebody help me with that?
Thanks!
Created ‎03-23-2017 11:30 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Inserting non-ascii strings from the command line might not work for a number of reasons, and even if it works it's inpractical for large data. Instead, put all your strings, any script supported by UTF-8 including cyrillic in a file, upload the file to HDFS, create an external table based on that file and try to explore the table using for example "... WHERE name='привет' ", that should work. Note that for a new table there is no need to declare serialization.encoding='UTF-8', it will be UTF-8 by default. Ditto for external tables already using UTF-8 data. You need it only if the input file is in non-UTF-8 character set like KOI8-R, and in such a case it will be serialization.encoding='KOI8-R'. For more details on using native charsets in Hive see my article.
Created ‎03-24-2017 10:45 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello!
Thanks for your reply.
I've tested with external file at hdfs and it works fine. Thanks.
But could somebody explain me why insert fails with cyrillic symbols? I've created small script in python with reads file.csv and runs insert against hive table. Also i've uploaded that file.csv to hdfs and created external table from it. So the results are different. As for external table hive works fine - shows me correct cyrillic symbols.
But values which were inserted by python from same file are incorrect.
So. Is that the only way to use cyrillic? Should i just write files to hdfs and use external table? Why insert doesn't work?
Thanks!
Created ‎03-25-2017 05:52 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, it seems to be a bug, reported but unattended: HIVE-13983. A workaround is to use INSERT INTO ... SELECT like
insert into test select 'привет' from test limit 1;
