- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Processing Fixed Width Files in Hive Using Native (Non-UTF8) Character Sets
- Labels:
-
Apache Hive
Created 07-21-2018 09:09 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a requirement to load Fixed Width file in hive table where input file is not always UTF-8 encoded.
I found 2 different classes are available for this - 'org.apache.hadoop.hive.serde2.RegexSerDe' to read from fixed width file on defined offset values and 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' for non utf8 encoding. But unable to use them together when creating external table.
Can someone of you please help me with a solution. Thanks in advance!!
Created 07-23-2018 01:35 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would just read the table with the LazySimpleSerDe and use the substr() function to extract out the columns. I've found that to be more performant than the RegexSerDe and it's clearer to read. You can either run the substring query directly or put it in a view.
Created 07-23-2018 01:35 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would just read the table with the LazySimpleSerDe and use the substr() function to extract out the columns. I've found that to be more performant than the RegexSerDe and it's clearer to read. You can either run the substring query directly or put it in a view.
Created 07-26-2018 07:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you Shawn for your prompt response. I found an alternate way. Did UTF-8 conversion using iconv before reading in external table with RegexSerDe. In my case Hive by default supports UTF-8 charactersets.