Created 04-10-2018 12:45 AM
I am trying to export the data from HDFS to Netezza and few french characters are giving me trouble. The only related post I found on internet is following.
http://grokbase.com/t/sqoop/user/137gtanzx8/sqoop-utf-8-data-load-issue
However the problem is I am not sure which configuration file he is talking about, would someone please let me know in which configuration file I need to provide, Connection Encoding?
Created 04-11-2018 12:17 AM
Hi @Gaurang Shah,
we need to note three main things on this character-encoding issue, that
1. What type of data we have in HDFS/Hive
On this context if the data is originally UTF8 encoded and stored as UTF8 coded data, there should not be an issue, however in some cases we load the linguistic encoding into Hive (it supports) and try to read the data in different encoding technique, such cases you will visualize the data with some weird characters, such cases we must ensure that we have provided the proper configuration to the de-serialize so that it can extract the accurate data(with out making any traslations)
for that you must specify at hive table level with serde properties
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" WITH SERDEPROPERTIES("serialization.encoding"='UTF-8');
UTF-8 can be any other char-set which is supported by serde library.
2. Letting the Sqoop know your character set
This will ensure that the character set is encoded and decoded with same encoding module.
on the sqoop import/export following property will ensure that you are not translating from one charector set to other and causing the untranslatable / any other junk mocked-up characters(described here).
--default-character-set=utf8
3. Target Character Set
Ensure that your, target table (in netezza/Teradata/Oracle) has the same character set defined for the column properties, so that it wont reject while loading the data, in most of the casess this is the root-cause for the failures,
on the other note - though you did not check first and second points which mentioned above you still will be able to load the data into target by making sure that target (Netezza )support rich-character set, but that doesnt mean that we have loaded the data as is ( instead we truncated and load)
while exporting the data you may use the hcat/query so that it can enforce the serde properties while extracting the data.
Hope this helps !!
Created 04-11-2018 01:06 PM
HDFS file has UTF-8 encoding and Netezza table also has UTF-8 encoding. the problem is with NFC (normalization)