Support Questions
Find answers, ask questions, and share your expertise
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

What is the exact process to automate data standardization or custom business rules with hive?

Expert Contributor

Hi Everyone,

Let me give you an overview of my requirements. The source and the target system is an apache hiveOne of our clients has a requirement to build/develop a solution for data standardization/correction rules using hiveQL.
E.g, Replace NULL values with XXX-XXX, Change date format in 
YYYY-MM-DD, Standardize amount column values in US & EU format, etc.


I have already completed the task of writing hive queries for these rules. Now, I am looking for the solution to automate this process(or hive queries) so that, it will act as a generic solution for any tables or columns to be applied. Something like passing the table name, column names, and rules as a parameter to a script.

The tried solution:
1) Pass the table name, column names, and rules as a parameter to a shell script.
2) Iterate on each column and it's respective rules one by one.
3) generate a hive query (as shown below) for each rule just by replacing the "column" word with the actual column name.
4) Once the loop got finished, execute one consolidated CTAS query.
CREATE TABLE output_table AS
rule1 on column1,
rule2 on column1,
rule1 on column2 
FROM source_table.

But there is a  problem with the above approach which is, If we pass two different rules on the same column then it would create two columns in the output table (which is wrong).

I would request all of you either suggest a solution for a tried one or completely different one will also work.

Sample Data correction hive queries:
1) Replace NULL values
   CASE WHEN column IS NULL OR trim(column)="" THEN 'XXX-MISSING-XXX' ELSE column       END AS column
2) Change date format
   CASE WHEN column IS NULL OR trim(column)="" THEN
to_date(from_unixtime(unix_timestamp('31/12/1900', 'dd/MM/yyyy'),'yyyy-MM-dd')) ELSE
to_date(from_unixtime(unix_timestamp(column, 'dd/MM/yyyy'),'yyyy-MM-dd')) END AS column,