Created 12-04-2015 12:58 PM
One of my client is trying to create an external Hive table in HDP from CSV files, (about 30 files, total of 2.5 TeraBytes)
But the files are formatted as: “Little-endian, UTF-16 Unicode text, with CRLF, CR line terminators”. Here are couple of issues
Is there an easy way to convert CSV/TXT files from Unicode (UTF-16 / UCS-2) to ASCII (UTF-8)?
Is there is a way for Hive to recognize this format?
He tried to use iconv to convert the utf-16 format to ascii format but it but it fails when source file is more than 15 GB file.
iconv -c -f utf-16 -t us-ascii
Any suggestions??
Created 12-04-2015 03:40 PM
Here are some solution options i received from Ryan Merriman, Benjamin Leonhardi & Peter Coates
Option1
You can use split –l to break the bigger file into small one while using iconv
Option2
I suppose it would be a good idea to write a little program using icu if iconv fails.
http://userguide.icu-project.org/conversion/converters
Option3
You can try to do it in Java. Here’s one example:
https://docs.oracle.com/javase/tutorial/i18n/text/stream.html
You can try using File(Input|Output)Stream and String classes. You can specify character encoding when reading (converting byte[] to String):
String s = String(byte[] bytes, Charset charset)
And when writing it back out (String to byte[]):
s.getBytes(Charset charset)
This approach should solve your size limit problem.
Created 12-04-2015 03:40 PM
Here are some solution options i received from Ryan Merriman, Benjamin Leonhardi & Peter Coates
Option1
You can use split –l to break the bigger file into small one while using iconv
Option2
I suppose it would be a good idea to write a little program using icu if iconv fails.
http://userguide.icu-project.org/conversion/converters
Option3
You can try to do it in Java. Here’s one example:
https://docs.oracle.com/javase/tutorial/i18n/text/stream.html
You can try using File(Input|Output)Stream and String classes. You can specify character encoding when reading (converting byte[] to String):
String s = String(byte[] bytes, Charset charset)
And when writing it back out (String to byte[]):
s.getBytes(Charset charset)
This approach should solve your size limit problem.
Created 06-29-2016 07:01 PM
I used NiFi's ConvertCharacterSet to change from UTF-16LE to UTF-8, it's a great and straightforward option if you're using it 🙂
Created 07-27-2016 04:11 PM
Hi, where i can find the character set values that are accepted by ConvertCharacterSet processor?
Also what component can i use to load CSV file and to dump results into the converted CSV file?
Created 07-28-2016 07:20 AM
So i found appropriate components but it doesnt convert the file properly, any idea? input file is a binary