<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Data Type Validation in Hive in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Data-Type-Validation-in-Hive/m-p/143253#M105842</link>
    <description>&lt;P&gt;My 2 cents to complement a bit Marco Gaido's answer.&lt;/P&gt;&lt;P&gt;Doing data validation entirely in Hive is not something uncommon. Basically, some companies use Hive as an ETL solution the same way they used to do it before with traditional databases:&lt;/P&gt;&lt;P&gt;1) loading the data in a first table ("staging") without doing any validation. The easiest way for that in your case would be to have a (Text file format) table where all the columns are defined as "string", so that you don't get any NULL values when doing a select&lt;/P&gt;&lt;P&gt;2) create a second table (ideally, ORC format) with the types correctly enforced (int, date, string...)&lt;/P&gt;&lt;P&gt;3) have a HQL select to fill the second table from the first table. Some basic transformations might be done to have the format that adapt to the corresponding types. In case that some rows don't pass the validation, that field would be marked as NULL or another "ERROR" token. In that query, you might have to write some complex structures/conditions using regex, UDFs etc.&lt;/P&gt;&lt;P&gt;4) Have another query to just extract all the valid rows from the second table and insert it into a third table (would be the production table). All the lines with NULL or ERROR would go into a fourth table (reporting table).&lt;/P&gt;&lt;P&gt;Note: there might be some ways to merge the steps 3 &amp;amp; 4 in order to reduce the number of queries, doing some "Hive multi inserts".&lt;/P&gt;</description>
    <pubDate>Thu, 12 May 2016 14:09:59 GMT</pubDate>
    <dc:creator>sluangsay</dc:creator>
    <dc:date>2016-05-12T14:09:59Z</dc:date>
  </channel>
</rss>

