final table - Hive managed - ORC storage (16 buckets)
I have a oozie job which is doing all the tasks in the following order :
Sqoop action extracts data from Teradata into a hdfs dir. (staging table is sitting on this dir.)
Hive action calls a Merge statement that Upserts data into final table from staging table.
When the oozie job is successful, I am finding that -
Staging table has correct number of records and data is correct.
Final table has more records than needed.
Extra rows in the final table have many columns as null
Timestamp values from staging table are not loading into final table and I am seeing all nulls for those columns. I am wrapping timestamp columns from staging table into from_unixtime UDF before loading to final table.
These findings remain correct during initial loading or incremental loading of the final table.
Following are two constraints that I am working with -
I have tried using TEXT storage for staging table, in that case it resolves the timestamp issue but other issue remain.
I cannot use PARQUET for staging table because sqoop cannot correctly import timestamp columns due to an open issue.
Is Merge from AVRO to ORC not supported/recommended with Merge? Is Hive Transaction functionality not production ready yet? Please suggest best course of action.