<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Hive table with UTF-16 data in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97900#M11411</link>
    <description>&lt;P&gt;So i found appropriate components but it doesnt convert the file properly, any idea? input file is a binary&lt;/P&gt;</description>
    <pubDate>Thu, 28 Jul 2016 14:20:07 GMT</pubDate>
    <dc:creator>lenovomi</dc:creator>
    <dc:date>2016-07-28T14:20:07Z</dc:date>
    <item>
      <title>Hive table with UTF-16 data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97896#M11407</link>
      <description>&lt;P&gt;One of my client is trying to create an external Hive table in HDP from CSV files,  (about 30 files, total of 2.5 TeraBytes)&lt;/P&gt;&lt;P&gt;But the files are formatted as: “Little-endian, UTF-16 Unicode text, with CRLF, CR line terminators”. Here are couple of issues&lt;/P&gt;&lt;P&gt;Is there an easy way to convert CSV/TXT files from Unicode (UTF-16  / UCS-2) to ASCII (UTF-8)? &lt;/P&gt;&lt;P&gt;Is there is a way for Hive to recognize this format?&lt;/P&gt;&lt;P&gt;He tried to use iconv to convert the utf-16 format to ascii format but it but it fails when source file is more than 15 GB file.&lt;/P&gt;&lt;P&gt;iconv -c  -f utf-16 -t us-ascii&lt;/P&gt;&lt;P&gt;Any suggestions??&lt;/P&gt;</description>
      <pubDate>Fri, 04 Dec 2015 20:58:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97896#M11407</guid>
      <dc:creator>csankaraiah</dc:creator>
      <dc:date>2015-12-04T20:58:09Z</dc:date>
    </item>
    <item>
      <title>Re: Hive table with UTF-16 data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97897#M11408</link>
      <description>&lt;P&gt;Here are some solution options i received from Ryan Merriman, Benjamin Leonhardi &amp;amp; Peter Coates &lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option1&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt; You can use split –l to break the bigger file into small one while using iconv&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option2&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I suppose it would be a good idea to write a little program using icu if iconv fails. &lt;/P&gt;&lt;P&gt;&lt;A href="http://userguide.icu-project.org/conversion/converters"&gt;http://userguide.icu-project.org/conversion/converters&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Option3&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;You can try to do it in Java.  Here’s one example:&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.oracle.com/javase/tutorial/i18n/text/stream.html"&gt;https://docs.oracle.com/javase/tutorial/i18n/text/stream.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;You can try using File(Input|Output)Stream and String classes.  You can specify character encoding when reading (converting byte[] to String):&lt;/P&gt;&lt;P&gt;String s = String(byte[] bytes, Charset charset)&lt;/P&gt;&lt;P&gt;And when writing it back out (String to byte[]):&lt;/P&gt;&lt;P&gt; s.getBytes(Charset charset)&lt;/P&gt;&lt;P&gt;This approach should solve your size limit problem.&lt;/P&gt;</description>
      <pubDate>Fri, 04 Dec 2015 23:40:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97897#M11408</guid>
      <dc:creator>csankaraiah</dc:creator>
      <dc:date>2015-12-04T23:40:07Z</dc:date>
    </item>
    <item>
      <title>Re: Hive table with UTF-16 data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97898#M11409</link>
      <description>&lt;P&gt;I used NiFi's &lt;A href="https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.standard.ConvertCharacterSet/"&gt;ConvertCharacterSet&lt;/A&gt; to change from UTF-16LE to UTF-8, it's a great and straightforward option if you're using it &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jun 2016 02:01:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97898#M11409</guid>
      <dc:creator>ahasson</dc:creator>
      <dc:date>2016-06-30T02:01:19Z</dc:date>
    </item>
    <item>
      <title>Re: Hive table with UTF-16 data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97899#M11410</link>
      <description>&lt;P&gt;Hi, where i can find the character set values that are accepted by ConvertCharacterSet processor?&lt;/P&gt;&lt;P&gt;Also what component can i use to load CSV file and to dump results into the converted CSV file?&lt;/P&gt;</description>
      <pubDate>Wed, 27 Jul 2016 23:11:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97899#M11410</guid>
      <dc:creator>lenovomi</dc:creator>
      <dc:date>2016-07-27T23:11:08Z</dc:date>
    </item>
    <item>
      <title>Re: Hive table with UTF-16 data</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97900#M11411</link>
      <description>&lt;P&gt;So i found appropriate components but it doesnt convert the file properly, any idea? input file is a binary&lt;/P&gt;</description>
      <pubDate>Thu, 28 Jul 2016 14:20:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hive-table-with-UTF-16-data/m-p/97900#M11411</guid>
      <dc:creator>lenovomi</dc:creator>
      <dc:date>2016-07-28T14:20:07Z</dc:date>
    </item>
  </channel>
</rss>

