Created 08-31-2016 08:14 AM
I have created a table as following but after inserting data in the table it shows junk characters upon retrial when it has special characters:
Example:
create database TestDB_C; use TestDB_C; CREATE TABLE IF NOT EXISTS customerA ( custId int, custName String, city String) COMMENT 'Customer details' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' STORED AS TEXTFILE; insert into customerA values (1000, "CustáOne", "BLR"); select * from customerA;
Notice the data "CustáOne" contains special character. I also tried using the following approach with no success:
ALTER TABLE customera SET serdeproperties ('serialization.encoding'='UTF-8');
Created 08-31-2016 03:59 PM
Try specifying the full Serde definition (although what you tried should work):
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ WITH SERDEPROPERTIES(“serialization.encoding”=’UTF-8′);
Created 08-31-2016 04:51 PM
I am not sure if specifying "ROW FORMAT SERDE" helps as once we create the table, the formatted description for this table shows the following:
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
It is already using the LazySimpleSerDe format.
@jk, Can you query the same data via a different tool like Excel or something else and check if you still see the same junk character ?
Created 09-01-2016 05:46 AM
@Michael Young the "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" approach did not work.
@srai I tried beeline, hive shell, hive view as well but no change in the result.
What i want to know is a simple step to insert a simple data with special character like "CustáOne" ... Can this data be inserted to a very simple hive table? What all step do i need to follow to do that.
Created 09-01-2016 04:39 PM
Can we store the simple data with special character "CustáOne" in hive database? And retrieve the data in the same format (without junk characters)?
Created 09-01-2016 05:43 PM
Yes, you can store UTF-8 characters in Hive tables and retrieve them. I have tested this with both Left-to-Right and Right-to-Left languages without any problems before. You typically don't have to do anything special to get it to work.
Try this:
1. Create a delimited text file with a couple of rows of data (including UTF-8 characters). You can use \t as the delimiter. 2. Make sure you save the file as an UTF-8 text file and push it to HDFS. 3. Create an external table in Hive that points to the directory where you placed that file. 4. Run the same query as before to see if the data is displayed correctly.
I'm wondering if there is something happening in the environment when you do the insert. Did you do the insert from the command line or did you use the Hive view to do it. From the command line, try setting your environment using before running hive or beeline.
export LANG=en_US.UTF-8
Created 09-27-2021 09:58 PM
I am reading a csv file with special characters
I have tried the option but the special characters still show up in hive as ? in a diamond shape.
CREATE EXTERNAL TABLE
)
.......
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
'quoteChar'='"',
'separatorChar'=',',
'serialization.encoding'='windows-1252'
)
STORED AS TEXTFILE
LOCATION '..................';
TBLPROPERTIES('serialization.encoding'='windows-1252');
Not sure what else to do
Created 09-27-2021 11:15 PM
@Venk123 as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.
Regards,
Vidya Sargur,Created 09-02-2016 04:27 AM
For command line input, as Michael said, check your LANG settings, it should end in "UTF-8" but it doesn't have to be "en_US", for example for Japanese it's "ja_JP.UTF-8". As for the Hive view, CVJK languages don't work and I'm afraid Europian umlauts don't work either. The fix is coming in Ambari-2.4.