We are serious about offloading ETL to hadoop. The existing data models are highly normalized in the current integration layer in the database. Below are specific questions with regards to the same:
1. Dimension modelling
Star schema has been tried on Hive. But the numerous joins have been bit of an overkill(queries run long and error out). Is there recommendation with file formats, compression to suit this use case? Are there alternate modelling techniques that can be used?
2. Surrogate key/foreign key - Normalization
Generation of Surrogate keys is something that we feel essential at data warehouse end. Is there a way of generating keys in ways similar to sequences on database?
3. Data refresh for lookups without incremental CDC
What could be the ideal way to refresh lookups in Hadoop?
4. SCD1 for reference data
What is the best way to implement SCD1? Is it to use workarounds such as insert overwrite in hive or writing java map reduce/spark code to implement it?
5. SCD2 for history
What is the best way to implement SCD2? Is it to use workarounds such as insert overwrite union all, analytical functions in hive or writing java map reduce/spark code to implement it?