Created 02-14-2017 07:47 PM
I have requirement to handle file which contains special characters (like trademarks, non-utf and so on..)
Created 02-15-2017 08:56 PM
@Reddy, You need to specify serialization.encoding property along with LazySimpleSerDe while creating table to load non-utf formatted data.
Here is one example:
create table table_with_non_utf8_encoding (name STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('serialization.encoding'='ISO8859_1'); load data local inpath '../encoding-ISO8859_1.txt' overwrite into table table_with_non_utf8_encoding;
More details in this jira:
Created 02-15-2017 08:56 PM
@Reddy, You need to specify serialization.encoding property along with LazySimpleSerDe while creating table to load non-utf formatted data.
Here is one example:
create table table_with_non_utf8_encoding (name STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ('serialization.encoding'='ISO8859_1'); load data local inpath '../encoding-ISO8859_1.txt' overwrite into table table_with_non_utf8_encoding;
More details in this jira:
Created on 02-15-2017 11:40 PM - edited 08-19-2019 05:00 AM
Yes, It is displaying the special characters with good reading format after adding serilization encoding property, however,while i am exporting the data to teradata with sqoop statement as using a connection manager i getting as non-readable characters in teradata. Attached is the screen shot(teradat.png). I suspect sqoop is not reconizing the special chracters correctly or do i need to use any specific teradata jar's while exporting the data ? I have attached the ingested data(after-ingestion-data-into-hadoop.png) and the showed the data in hive after adding encoding property(after-adding-encoding-to-hive-table.png), where as the same data is not same in Teradata. I would like to see the same type of characters in teradata as-well. Any Help appreciated.
)
Created 02-16-2017 03:15 AM
I found a solution to export this kind of data to any RDBS in the form of UTF8 or any other character set by giving the specific character set after the database/host name.