I am looking for best practice/suggested workflow for a data science project. Current workflow is as following:
Source data is stored in MS SQLSERVER (both DW with fact & dimension tables and OLTP with normalized tables) We explore the data writing SQL queries and the final step of data engineering is to create complex sql queries/views to create a data set which combines various data into single dataset. Example of a resulting dataset would be customer info+ demographic+their transaction etc We use python to load the data into dataframe and use machine learning library to build the models In our pilot project (hadoop environment) an EDL has been created by exporting all the source systems into HDFS and Hive tables are created on top of that. Could you suggest what are the best practices we can follow to leverage the most - I am interested in the following:
SQL tables had index/PK/FK which doesn't exist on Hive. How do we address this? For example if we need to join 2-3 large tables on hive - how do we optimize this? Shall we use buckets? Or partitioning? Shall we use new tables on top of the SQOOP exported hive tables? A fact table has many key columns and indexes. How do we use buckets or partitioning on those tables? In SQL - we used to create views/derived tables that would be used as input to ML development. What could be the alternative here? A new hive table by joining the other hive tables? Or use spark to load the different hive tables and join using a pyspark workflow to do the heavy lifting? For the large fact tables loaded in hdfs and hive which typically has many key columns (and indexes on SQL) how do we optimize the access while joining them? Create multiple version of the same table with different bucketing/partitioning keys?
... View more