question Re: hadoop where to provide connection encoding in Support Questions

question Re: hadoop where to provide connection encoding in Support Questions https://community.cloudera.com/t5/Support-Questions/hadoop-where-to-provide-connection-encoding/m-p/215002#M176914 Hi <A rel="user" href="https://community.cloudera.com/users/62318/gaurangnshah.html" nodeid="62318">@Gaurang Shah</A>,we need to note three main things on this character-encoding issue, that1. What type of data we have in HDFS/Hive On this context if the data is originally UTF8 encoded and stored as UTF8 coded data, there should not be an issue, however in some cases we load the linguistic encoding into Hive (it supports) and try to read the data in different encoding technique, such cases you will visualize the data with some weird characters, such cases we must ensure that we have provided the proper configuration to the de-serialize so that it can extract the accurate data(with out making any traslations)for that you must specify at hive table level with serde properties <PRE>ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe" WITH SERDEPROPERTIES("serialization.encoding"='UTF-8');</PRE>UTF-8 can be any other char-set which is supported by serde library. 2. Letting the Sqoop know your character setThis will ensure that the character set is encoded and decoded with same encoding module.on the sqoop import/export following property will ensure that you are not translating from one charector set to other and causing the untranslatable / any other junk mocked-up characters(described <A href="https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_controlling_the_import_process" target="_blank">here</A>).<PRE>--default-character-set=utf8</PRE>3. Target Character SetEnsure that your, target table (in netezza/Teradata/Oracle) has the same character set defined for the column properties, so that it wont reject while loading the data, in most of the casess this is the root-cause for the failures,on the other note - though you did not check first and second points which mentioned above you still will be able to load the data into target by making sure that target (Netezza )support rich-character set, but that doesnt mean that we have loaded the data as is ( instead we truncated and load)while exporting the data you may use the hcat/query so that it can enforce the serde properties while extracting the data.Hope this helps !! Wed, 11 Apr 2018 07:17:43 GMT bkosaraju 2018-04-11T07:17:43Z