One of my client is trying to create an external Hive table in HDP from CSV files, (about 30 files, total of 2.5 TeraBytes)
But the files are formatted as: “Little-endian, UTF-16 Unicode text, with CRLF, CR line terminators”. Here are couple of issues
Is there an easy way to convert CSV/TXT files from Unicode (UTF-16 / UCS-2) to ASCII (UTF-8)?
Is there is a way for Hive to recognize this format?
He tried to use iconv to convert the utf-16 format to ascii format but it but it fails when source file is more than 15 GB file.
iconv -c -f utf-16 -t us-ascii
Here are some solution options i received from Ryan Merriman, Benjamin Leonhardi & Peter Coates
You can use split –l to break the bigger file into small one while using iconv
I suppose it would be a good idea to write a little program using icu if iconv fails.
You can try to do it in Java. Here’s one example:
You can try using File(Input|Output)Stream and String classes. You can specify character encoding when reading (converting byte to String):
String s = String(byte bytes, Charset charset)
And when writing it back out (String to byte):
This approach should solve your size limit problem.
View solution in original post
I used NiFi's ConvertCharacterSet to change from UTF-16LE to UTF-8, it's a great and straightforward option if you're using it 🙂
Hi, where i can find the character set values that are accepted by ConvertCharacterSet processor?
Also what component can i use to load CSV file and to dump results into the converted CSV file?
So i found appropriate components but it doesnt convert the file properly, any idea? input file is a binary