Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Best practises for migrating data warehouse to data lake for Machine Learning

New Contributor

I am looking for best practice/suggested workflow for a data science project. Current workflow is as following:

  1. Source data is stored in MS SQLSERVER (both DW with fact & dimension tables and OLTP with normalized tables)
  2. We explore the data writing SQL queries and the final step of data engineering is to create complex sql queries/views to create a data set which combines various data into single dataset. Example of a resulting dataset would be customer info+ demographic+their transaction etc
  3. We use python to load the data into dataframe and use machine learning library to build the models

In our pilot project (hadoop environment) an EDL has been created by exporting all the source systems into HDFS and Hive tables are created on top of that. Could you suggest what are the best practices we can follow to leverage the most - I am interested in the following:

  1. SQL tables had index/PK/FK which doesn't exist on Hive. How do we address this? For example if we need to join 2-3 large tables on hive - how do we optimize this? Shall we use buckets? Or partitioning? Shall we use new tables on top of the SQOOP exported hive tables?
  2. A fact table has many key columns and indexes. How do we use buckets or partitioning on those tables?
  3. In SQL - we used to create views/derived tables that would be used as input to ML development. What could be the alternative here? A new hive table by joining the other hive tables? Or use spark to load the different hive tables and join using a pyspark workflow to do the heavy lifting?
  4. For the large fact tables loaded in hdfs and hive which typically has many key columns (and indexes on SQL) how do we optimize the access while joining them? Create multiple version of the same table with different bucketing/partitioning keys?
2 REPLIES 2

Super Guru
@Sabyasachi Mukherjee
  1. SQL tables had index/PK/FK which doesn't exist on Hive. How do we address this? For example if we need to join 2-3 large tables on hive - how do we optimize this? Shall we use buckets? Or partitioning? Shall we use new tables on top of the SQOOP exported hive tables?

Use both partitioning and bucketing to improve performance.

Partitioning is the the technique to distribute data physically across different partitions. Imagine you partition by year/month/day. Now imagine running a query to scan data and in your where clause you have a date. Well, Hive in this case will physically scan only the partition which has that date. You are not doing a table space scan. Partitioning however, works so much. You cannot have for example a million partition. That will be too much work in query preparation. So you have to find a balance. Some people will tell you that 100K partitions are fine and others might say that you shouldn't go over 50K partitions. I am conservative and will suggest no more than 50K partitions per table. Remember, sub partitions like "year/month/date" will multiply faster. So if you have daily data, then see the number of partitions you will have if you partition by day vs month and ideally in my opinion, don't go over 50K partitions. Partition is for data with low cardinality.

Now, you may also have data like "id" which have high cardinality. Use bucketing for this. Bucketing will hash your ids to a fix number of buckets. This certainly helps with joins when Hive knows that id x will always hash to bucket x. This comes handy when you join two tables based on bucketing column.

2. A fact table has many key columns and indexes. How do we use buckets or partitioning on those tables?

Bucket columns with high cardinality and partition on low cardinality columns. Partitioning will reduce the amount of data that will be physically scanned. This is very important for running efficient queries.

3. In SQL - we used to create views/derived tables that would be used as input to ML development. What could be the alternative here? A new hive table by joining the other hive tables? Or use spark to load the different hive tables and join using a pyspark workflow to do the heavy lifting?

Views in Hive are logical and not materialized - someone can correct me if this has changed, but my understanding is views are just logical, so not sure if this would help.

I like the pyspark approach. Create a temp table and run your workflow and drop it at the end.

4. For the large fact tables loaded in hdfs and hive which typically has many key columns (and indexes on SQL) how do we optimize the access while joining them? Create multiple version of the same table with different bucketing/partitioning keys?

I think this is same as question 1 and 2. I think partitioning will be just as good as creating multiple tables unless you have 100's of thousands pf partitions. In that case may be divide your table in more tables but you usually don't really need to do this. Use partitioning with bucketing and you should be able to have good performance.

Super Guru

Hive is a good place for the data. HiveMall can run machine learning in SQL.

SparkSQL + SparkMLLib works great on Hive data

I have used Apache NiFi to load from SQL Server tables land into ORC files with Hive tables on top. From there you can easily run your ML in Python as it comes in.

I even run TensorFlow on data as it is in motion

Denormalized into one wide table often works, because you can never have too many rows in Hive.

https://community.hortonworks.com/articles/58265/analyzing-images-in-hdf-20-using-tensorflow.html

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.