I am looking for best practice/suggested workflow for a data science project. Current workflow is as following:
In our pilot project (hadoop environment) an EDL has been created by exporting all the source systems into HDFS and Hive tables are created on top of that. Could you suggest what are the best practices we can follow to leverage the most - I am interested in the following:
Use both partitioning and bucketing to improve performance.
Partitioning is the the technique to distribute data physically across different partitions. Imagine you partition by year/month/day. Now imagine running a query to scan data and in your where clause you have a date. Well, Hive in this case will physically scan only the partition which has that date. You are not doing a table space scan. Partitioning however, works so much. You cannot have for example a million partition. That will be too much work in query preparation. So you have to find a balance. Some people will tell you that 100K partitions are fine and others might say that you shouldn't go over 50K partitions. I am conservative and will suggest no more than 50K partitions per table. Remember, sub partitions like "year/month/date" will multiply faster. So if you have daily data, then see the number of partitions you will have if you partition by day vs month and ideally in my opinion, don't go over 50K partitions. Partition is for data with low cardinality.
Now, you may also have data like "id" which have high cardinality. Use bucketing for this. Bucketing will hash your ids to a fix number of buckets. This certainly helps with joins when Hive knows that id x will always hash to bucket x. This comes handy when you join two tables based on bucketing column.
2. A fact table has many key columns and indexes. How do we use buckets or partitioning on those tables?
Bucket columns with high cardinality and partition on low cardinality columns. Partitioning will reduce the amount of data that will be physically scanned. This is very important for running efficient queries.
3. In SQL - we used to create views/derived tables that would be used as input to ML development. What could be the alternative here? A new hive table by joining the other hive tables? Or use spark to load the different hive tables and join using a pyspark workflow to do the heavy lifting?
Views in Hive are logical and not materialized - someone can correct me if this has changed, but my understanding is views are just logical, so not sure if this would help.
I like the pyspark approach. Create a temp table and run your workflow and drop it at the end.
4. For the large fact tables loaded in hdfs and hive which typically has many key columns (and indexes on SQL) how do we optimize the access while joining them? Create multiple version of the same table with different bucketing/partitioning keys?
I think this is same as question 1 and 2. I think partitioning will be just as good as creating multiple tables unless you have 100's of thousands pf partitions. In that case may be divide your table in more tables but you usually don't really need to do this. Use partitioning with bucketing and you should be able to have good performance.
Hive is a good place for the data. HiveMall can run machine learning in SQL.
SparkSQL + SparkMLLib works great on Hive data
I have used Apache NiFi to load from SQL Server tables land into ORC files with Hive tables on top. From there you can easily run your ML in Python as it comes in.
I even run TensorFlow on data as it is in motion
Denormalized into one wide table often works, because you can never have too many rows in Hive.