Let me give you an overview of my requirements. The source and the target system is an apache hive. One of our clients has a requirement to build/develop a solution for data standardization/correction rules using hiveQL. E.g, Replace NULL values with XXX-XXX, Change date format in YYYY-MM-DD, Standardize amount column values in US & EU format, etc.
I have already completed the task of writing hive queries for these rules. Now, I am looking for the solution to automate this process(or hive queries) so that, it will act as a generic solution for any tables or columns to be applied. Something like passing the table name, column names, and rules as a parameter to a script.
The tried solution: 1) Pass the table name, column names, and rules as a parameter to a shell script. 2) Iterate on each column and it's respective rules one by one. 3) generate a hive query (as shown below) for each rule just by replacing the "column" word with the actual column name. 4) Once the loop got finished, execute one consolidated CTAS query. CREATE TABLE output_table AS SELECT rule1 on column1, rule2 on column1, rule1 on column2 FROM source_table.
But there is a problem with the above approach which is, If we pass two different rules on the same column then it would create two columns in the output table (which is wrong).
I would request all of you either suggest a solution for a tried one or completely different one will also work.
Sample Data correction hive queries: 1) Replace NULL values CASE WHEN column IS NULL OR trim(column)="" THEN 'XXX-MISSING-XXX' ELSE column END AS column 2) Change date format CASE WHEN column IS NULL OR trim(column)="" THEN to_date(from_unixtime(unix_timestamp('31/12/1900', 'dd/MM/yyyy'),'yyyy-MM-dd')) ELSE to_date(from_unixtime(unix_timestamp(column, 'dd/MM/yyyy'),'yyyy-MM-dd')) END AS column,