Created 05-05-2017 12:49 PM
Hi All,
I have some string text in UTF-8 in Hive tables. Querying using HiveQL from terminal or using SparkQL using spark-shell produces the text correctly, no encoding error:
>> select address from customerinfo limit 2; address ---------------- Brahegatan 50 Envägen 12
However, when I query the data using Hive View in Ambari, the result are not display correctly.
select address from customerinfo limit 2; address ---------------- Brahegatan 50 Env?gen 12
Similar issue happens when querying with Zeppelin using SparkQL.
Is this a bug in Hive view or Ambari? Do you know what is the cause of the problem and the solution?
BR,
/Nhan
Created 05-05-2017 12:58 PM
Which version of ambari are you using?
Some older version of ambari (prior to Ambari 2.4) has some issue with multibyte characters that till cause such issues:
Created 05-05-2017 01:23 PM
I am using HDP 2.6 with Ambari 2.5.
We have such problem both in Zeppelin and Hive-view.
Created 05-05-2017 03:39 PM
@Nhan Nguyen In HDP2.6 Ambari2.5 i can reproduce the issue.
Looks like the issue is from Hive Side: https://issues.apache.org/jira/browse/HIVE-15927
Example:
[root@sandbox hive-next-view]# su - hive [hive@sandbox ~]$ beeline Beeline version 1.2.1000.2.6.0.3-8 by Apache Hive beeline> !connect jdbc:hive2://sandbox.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 Connecting to jdbc:hive2://sandbox.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 Enter username for jdbc:hive2://sandbox.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2: hive Enter password for jdbc:hive2://sandbox.hortonworks.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2: **** Connected to: Apache Hive (version 1.2.1000.2.6.0.3-8) Driver: Hive JDBC (version 1.2.1000.2.6.0.3-8) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://sandbox.hortonworks.com:2181/> SELECT * FROM customerinfo; +-----------------------+--+ | customerinfo.address | +-----------------------+--+ | Env�gen | +-----------------------+--+ 1 row selected (0.255 seconds)
.
Created 05-05-2017 08:57 PM
This is not a Hive issue rather a file system or file encoding issue. SELECT * in Hive actually does nothing except read the file from file system. So if you run a hadoop fs cat on your underlying file, you should see the same behavior.
You can check file encoding on bash as $ file -i filename
You can change the encoding using iconv. And convert to utf-8 which is printable encoding.
iconv -f current_encoding -t new_encoding input.file -o out.file
Created 05-08-2017 09:05 AM
We have checked and the encoding for the file is UTF-8. Please note that if we use hive command line interface to SELECT *, the text display correctly in the terminal. But in Hive View and Zeppelin, the text display wrong.
So, it might not be an issue of Hive but Hive view/Zeppelin.
Created 05-06-2017 05:32 PM
This seems more like a combination of encoding of the source/input file. So like @Jay SenSharma mention
0: jdbc:hive2://xlnode-2.h.c:2181,xlnode-3.h.> select * from abc_orc; +---------------+--+ | abc_orc.col1 | +---------------+--+ | Env�gen | +---------------+--+
We can check the file format of this file
-bash-4.1$ hdfs dfs -get /apps/hive/warehouse/abc/000000_0 . -bash-4.1$ file 000000_0 000000_0: ISO-8859 text
and like @Umair Khan stated, if we convert the encode, we can see the file accordingly
-bash-4.1$ iconv -f ISO-8859-1 -t UTF-8//TRANSLIT 000000_0 -o 000000_1 -bash-4.1$ file 000000_1 000000_1: UTF-8 Unicode text -bash-4.1$ -bash-4.1$ -bash-4.1$ hdfs dfs -put 000000_1 /apps/hive/warehouse/abc/ -bash-4.1$ beeline -u "jdbc:hive2://xlnode-2.h.c:2181,xlnode-3.h.c:2181,xlnode-1.h.c:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -n hive -p '' Connecting to jdbc:hive2://xlnode-2.h.c:2181,xlnode-3.h.c:2181,xlnode-1.h.c:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2 Connected to: Apache Hive (version 1.2.1000.2.6.0.3-8) Driver: Hive JDBC (version 1.2.1000.2.6.0.3-8) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1000.2.6.0.3-8 by Apache Hive 0: jdbc:hive2://xlnode-2.h.c:2181,xlnode-3.h.> select * from abc; +-----------+--+ | abc.col1 | +-----------+--+ | Env�gen | | Envägen | +-----------+--+ 2 rows selected (0.26 seconds) 0: jdbc:hive2://xlnode-2.h.c:2181,xlnode-3.h.>
Can you try using a different browser, or if you are using chrome, can enable supporting all the encodings !! see if that works
Created 05-08-2017 09:09 AM
We have checked and the encoding for the file is UTF-8. Please note that if we use hive command line interface to SELECT *, the text display correctly in the terminal. But in Hive View and Zeppelin, the text display wrong.
I have also tested with different browser and different encodings, but Hive view just cannot rendered if not using Unicode UTF-8.
Created 05-08-2017 09:35 AM
I have made a test. I create a text file file content in Unicode UTF-8:
> cat test.csv björn,alvägen > file test.csv test.csv: UTF-8 Unicode text
And create a table reading from that csv file;
hive> CREATE EXTERNAL TABLE test ( > column1 String, > column2 String > ) > ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' > LOCATION '/user/myuser/test/'; hive> select * from test; OK björn alvägen Time taken: 0.058 seconds, Fetched: 1 row(s)
But in HIVE VIEW:
test.column1 test.column2 bj?rn alv?gen
Similar issue happens for Zeppelin.
So, it is not a problem with encoding of source file @Jay SenSharma, @Umair Khan @Shyam Sunder Rai as I have source file in URF-8 and Hive can query data and display it in correct UTF-8. It must be an issue with HIVE VIEW and Zeppelin