Support Questions

csankaraiah · ‎12-04-2015

One of my client is trying to create an external Hive table in HDP from CSV files, (about 30 files, total of 2.5 TeraBytes)

But the files are formatted as: “Little-endian, UTF-16 Unicode text, with CRLF, CR line terminators”. Here are couple of issues

Is there an easy way to convert CSV/TXT files from Unicode (UTF-16 / UCS-2) to ASCII (UTF-8)?

Is there is a way for Hive to recognize this format?

He tried to use iconv to convert the utf-16 format to ascii format but it but it fails when source file is more than 15 GB file.

iconv -c -f utf-16 -t us-ascii

Any suggestions??

csankaraiah · ‎12-04-2015

Here are some solution options i received from Ryan Merriman, Benjamin Leonhardi & Peter Coates

Option1

You can use split –l to break the bigger file into small one while using iconv

Option2

I suppose it would be a good idea to write a little program using icu if iconv fails.

Option3

You can try to do it in Java. Here’s one example:

You can try using File(Input|Output)Stream and String classes. You can specify character encoding when reading (converting byte[] to String):

String s = String(byte[] bytes, Charset charset)

And when writing it back out (String to byte[]):

s.getBytes(Charset charset)

This approach should solve your size limit problem.

csankaraiah · ‎12-04-2015