Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

hadoop where to provide connection encoding

hadoop where to provide connection encoding

I am trying to export the data from HDFS to Netezza and few french characters are giving me trouble. The only related post I found on internet is following.

http://grokbase.com/t/sqoop/user/137gtanzx8/sqoop-utf-8-data-load-issue

However the problem is I am not sure which configuration file he is talking about, would someone please let me know in which configuration file I need to provide, Connection Encoding?

2 REPLIES 2
Highlighted

Re: hadoop where to provide connection encoding

Super Collaborator

Hi @Gaurang Shah,

we need to note three main things on this character-encoding issue, that

1. What type of data we have in HDFS/Hive

On this context if the data is originally UTF8 encoded and stored as UTF8 coded data, there should not be an issue, however in some cases we load the linguistic encoding into Hive (it supports) and try to read the data in different encoding technique, such cases you will visualize the data with some weird characters, such cases we must ensure that we have provided the proper configuration to the de-serialize so that it can extract the accurate data(with out making any traslations)

for that you must specify at hive table level with serde properties

ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" WITH SERDEPROPERTIES("serialization.encoding"='UTF-8');

UTF-8 can be any other char-set which is supported by serde library.

2. Letting the Sqoop know your character set

This will ensure that the character set is encoded and decoded with same encoding module.

on the sqoop import/export following property will ensure that you are not translating from one charector set to other and causing the untranslatable / any other junk mocked-up characters(described here).

--default-character-set=utf8

3. Target Character Set

Ensure that your, target table (in netezza/Teradata/Oracle) has the same character set defined for the column properties, so that it wont reject while loading the data, in most of the casess this is the root-cause for the failures,

on the other note - though you did not check first and second points which mentioned above you still will be able to load the data into target by making sure that target (Netezza )support rich-character set, but that doesnt mean that we have loaded the data as is ( instead we truncated and load)

while exporting the data you may use the hcat/query so that it can enforce the serde properties while extracting the data.

Hope this helps !!

Highlighted

Re: hadoop where to provide connection encoding

@bkosaraju

HDFS file has UTF-8 encoding and Netezza table also has UTF-8 encoding. the problem is with NFC (normalization)

Don't have an account?
Coming from Hortonworks? Activate your account here