Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Why hive is not able to store special characters like 'á' ?

avatar

I have created a table as following but after inserting data in the table it shows junk characters upon retrial when it has special characters:

Example:

create database TestDB_C;

use TestDB_C;

CREATE TABLE IF NOT EXISTS customerA ( custId int, custName String, city String)
COMMENT 'Customer details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

insert into customerA values (1000, "CustáOne", "BLR");

select * from customerA;

Notice the data "CustáOne" contains special character. I also tried using the following approach with no success:

 ALTER TABLE customera SET serdeproperties ('serialization.encoding'='UTF-8');
8 REPLIES 8

avatar
Super Guru

@jk

Try specifying the full Serde definition (although what you tried should work):

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’ WITH SERDEPROPERTIES(“serialization.encoding”=’UTF-8′);

avatar
Guru

@Michael Young

I am not sure if specifying "ROW FORMAT SERDE" helps as once we create the table, the formatted description for this table shows the following:

SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

It is already using the LazySimpleSerDe format.

@jk, Can you query the same data via a different tool like Excel or something else and check if you still see the same junk character ?

avatar

@Michael Young the "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" approach did not work.

@srai I tried beeline, hive shell, hive view as well but no change in the result.

What i want to know is a simple step to insert a simple data with special character like "CustáOne" ... Can this data be inserted to a very simple hive table? What all step do i need to follow to do that.

avatar

Can we store the simple data with special character "CustáOne" in hive database? And retrieve the data in the same format (without junk characters)?

avatar
Super Guru

Yes, you can store UTF-8 characters in Hive tables and retrieve them. I have tested this with both Left-to-Right and Right-to-Left languages without any problems before. You typically don't have to do anything special to get it to work.

Try this:

1. Create a delimited text file with a couple of rows of data (including UTF-8 characters).  You can use \t as the delimiter.
2. Make sure you save the file as an UTF-8 text file and push it to HDFS.
3. Create an external table in Hive that points to the directory where you placed that file.
4. Run the same query as before to see if the data is displayed correctly.

I'm wondering if there is something happening in the environment when you do the insert. Did you do the insert from the command line or did you use the Hive view to do it. From the command line, try setting your environment using before running hive or beeline.

export LANG=en_US.UTF-8

avatar
New Contributor

I am reading a csv file with special characters

I have tried the option but the special characters still show up in hive as ? in a diamond shape.

 

CREATE EXTERNAL TABLE

)

.......

)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
'quoteChar'='"',
'separatorChar'=',',
'serialization.encoding'='windows-1252'
)
STORED AS TEXTFILE
LOCATION '..................';
TBLPROPERTIES('serialization.encoding'='windows-1252');

 

Venk123_0-1632805049298.png

 

Not sure what else to do

avatar
Community Manager

@Venk123 as this is an older post, you would have a better chance of receiving a resolution by starting a new thread. This will also be an opportunity to provide details specific to your environment that could aid others in assisting you with a more accurate answer to your question. You can link this thread as a reference in your new post.



Regards,

Vidya Sargur,
Community Manager


Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:

avatar
Master Guru

For command line input, as Michael said, check your LANG settings, it should end in "UTF-8" but it doesn't have to be "en_US", for example for Japanese it's "ja_JP.UTF-8". As for the Hive view, CVJK languages don't work and I'm afraid Europian umlauts don't work either. The fix is coming in Ambari-2.4.