I have created a table as following but after inserting data in the table it shows junk characters upon retrial when it has special characters:
create database TestDB_C; use TestDB_C; CREATE TABLE IF NOT EXISTS customerA ( custId int, custName String, city String) COMMENT 'Customer details' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE; insert into customerA values (1000, "CustáOne", "BLR"); select * from customerA;
Notice the data "CustáOne" contains special character. I also tried using the following approach with no success:
ALTER TABLE customera SET serdeproperties ('serialization.encoding'='UTF-8');
Try specifying the full Serde definition (although what you tried should work):
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ WITH SERDEPROPERTIES(“serialization.encoding”=’UTF-8′);
I am not sure if specifying "ROW FORMAT SERDE" helps as once we create the table, the formatted description for this table shows the following:
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
It is already using the LazySimpleSerDe format.
@jk, Can you query the same data via a different tool like Excel or something else and check if you still see the same junk character ?
@Michael Young the "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" approach did not work.
@srai I tried beeline, hive shell, hive view as well but no change in the result.
What i want to know is a simple step to insert a simple data with special character like "CustáOne" ... Can this data be inserted to a very simple hive table? What all step do i need to follow to do that.
Yes, you can store UTF-8 characters in Hive tables and retrieve them. I have tested this with both Left-to-Right and Right-to-Left languages without any problems before. You typically don't have to do anything special to get it to work.
1. Create a delimited text file with a couple of rows of data (including UTF-8 characters). You can use \t as the delimiter. 2. Make sure you save the file as an UTF-8 text file and push it to HDFS. 3. Create an external table in Hive that points to the directory where you placed that file. 4. Run the same query as before to see if the data is displayed correctly.
I'm wondering if there is something happening in the environment when you do the insert. Did you do the insert from the command line or did you use the Hive view to do it. From the command line, try setting your environment using before running hive or beeline.
For command line input, as Michael said, check your LANG settings, it should end in "UTF-8" but it doesn't have to be "en_US", for example for Japanese it's "ja_JP.UTF-8". As for the Hive view, CVJK languages don't work and I'm afraid Europian umlauts don't work either. The fix is coming in Ambari-2.4.