Created 05-11-2016 08:35 PM
Hi,
I have data files, which need to be uploaded to Hive. But there is requirement to check the values of columns with non string data type, like int, timestamp etc, have the data according to the type. Like int column has int value, date column should have date etc. Kindly suggest me how can I do this in hive. Or if there is any way to validate the file. File is delimired file.
Regards
Mamta Chawla
Created 05-11-2016 08:49 PM
With Hive what you can do is reading those fields as string and validate them through regexp. Otherwise, if you are sure that you don't have NULL values, you can simply define your schema and if a int field is NULL, this means that it is not properly formatted.
Created 05-11-2016 09:01 PM
You can do two things - create a Hive user defined function that validates each type. or use PIG to do this function prior to loading the data into Hive.
Created 05-11-2016 09:29 PM
Hive does not do data validation based on the fields. Its the users responsibility to check the data and see if it matches with the table being created.
May be this link can help you out with what you are looking for: https://community.hortonworks.com/articles/1283/hive-script-to-validate-tables-compare-one-with-an.h...
Created 05-12-2016 07:09 AM
My 2 cents to complement a bit Marco Gaido's answer.
Doing data validation entirely in Hive is not something uncommon. Basically, some companies use Hive as an ETL solution the same way they used to do it before with traditional databases:
1) loading the data in a first table ("staging") without doing any validation. The easiest way for that in your case would be to have a (Text file format) table where all the columns are defined as "string", so that you don't get any NULL values when doing a select
2) create a second table (ideally, ORC format) with the types correctly enforced (int, date, string...)
3) have a HQL select to fill the second table from the first table. Some basic transformations might be done to have the format that adapt to the corresponding types. In case that some rows don't pass the validation, that field would be marked as NULL or another "ERROR" token. In that query, you might have to write some complex structures/conditions using regex, UDFs etc.
4) Have another query to just extract all the valid rows from the second table and insert it into a third table (would be the production table). All the lines with NULL or ERROR would go into a fourth table (reporting table).
Note: there might be some ways to merge the steps 3 & 4 in order to reduce the number of queries, doing some "Hive multi inserts".