Support Questions
Find answers, ask questions, and share your expertise

Processing Fixed Width Files in Hive Using Native (Non-UTF8) Character Sets

New Contributor

Hi,

I have a requirement to load Fixed Width file in hive table where input file is not always UTF-8 encoded.

I found 2 different classes are available for this - 'org.apache.hadoop.hive.serde2.RegexSerDe' to read from fixed width file on defined offset values and 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' for non utf8 encoding. But unable to use them together when creating external table.

Can someone of you please help me with a solution. Thanks in advance!!

1 ACCEPTED SOLUTION

Rising Star

I would just read the table with the LazySimpleSerDe and use the substr() function to extract out the columns. I've found that to be more performant than the RegexSerDe and it's clearer to read. You can either run the substring query directly or put it in a view.

View solution in original post

2 REPLIES 2

Rising Star

I would just read the table with the LazySimpleSerDe and use the substr() function to extract out the columns. I've found that to be more performant than the RegexSerDe and it's clearer to read. You can either run the substring query directly or put it in a view.

New Contributor

Thank you Shawn for your prompt response. I found an alternate way. Did UTF-8 conversion using iconv before reading in external table with RegexSerDe. In my case Hive by default supports UTF-8 charactersets.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.