- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Hive table with UTF-16 data
- Labels:
-
Apache Hive
Created 12-04-2015 12:58 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One of my client is trying to create an external Hive table in HDP from CSV files, (about 30 files, total of 2.5 TeraBytes)
But the files are formatted as: “Little-endian, UTF-16 Unicode text, with CRLF, CR line terminators”. Here are couple of issues
Is there an easy way to convert CSV/TXT files from Unicode (UTF-16 / UCS-2) to ASCII (UTF-8)?
Is there is a way for Hive to recognize this format?
He tried to use iconv to convert the utf-16 format to ascii format but it but it fails when source file is more than 15 GB file.
iconv -c -f utf-16 -t us-ascii
Any suggestions??
Created 12-04-2015 03:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are some solution options i received from Ryan Merriman, Benjamin Leonhardi & Peter Coates
Option1
You can use split –l to break the bigger file into small one while using iconv
Option2
I suppose it would be a good idea to write a little program using icu if iconv fails.
http://userguide.icu-project.org/conversion/converters
Option3
You can try to do it in Java. Here’s one example:
https://docs.oracle.com/javase/tutorial/i18n/text/stream.html
You can try using File(Input|Output)Stream and String classes. You can specify character encoding when reading (converting byte[] to String):
String s = String(byte[] bytes, Charset charset)
And when writing it back out (String to byte[]):
s.getBytes(Charset charset)
This approach should solve your size limit problem.
Created 12-04-2015 03:40 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here are some solution options i received from Ryan Merriman, Benjamin Leonhardi & Peter Coates
Option1
You can use split –l to break the bigger file into small one while using iconv
Option2
I suppose it would be a good idea to write a little program using icu if iconv fails.
http://userguide.icu-project.org/conversion/converters
Option3
You can try to do it in Java. Here’s one example:
https://docs.oracle.com/javase/tutorial/i18n/text/stream.html
You can try using File(Input|Output)Stream and String classes. You can specify character encoding when reading (converting byte[] to String):
String s = String(byte[] bytes, Charset charset)
And when writing it back out (String to byte[]):
s.getBytes(Charset charset)
This approach should solve your size limit problem.
Created 06-29-2016 07:01 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I used NiFi's ConvertCharacterSet to change from UTF-16LE to UTF-8, it's a great and straightforward option if you're using it 🙂
Created 07-27-2016 04:11 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, where i can find the character set values that are accepted by ConvertCharacterSet processor?
Also what component can i use to load CSV file and to dump results into the converted CSV file?
Created 07-28-2016 07:20 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So i found appropriate components but it doesnt convert the file properly, any idea? input file is a binary
