- SQL tables had index/PK/FK which doesn't exist on Hive. How do we address this? For example if we need to join 2-3 large tables on hive - how do we optimize this? Shall we use buckets? Or partitioning? Shall we use new tables on top of the SQOOP exported hive tables?
Use both partitioning and bucketing to improve performance.
Partitioning is the the technique to distribute data physically across different partitions. Imagine you partition by year/month/day. Now imagine running a query to scan data and in your where clause you have a date. Well, Hive in this case will physically scan only the partition which has that date. You are not doing a table space scan. Partitioning however, works so much. You cannot have for example a million partition. That will be too much work in query preparation. So you have to find a balance. Some people will tell you that 100K partitions are fine and others might say that you shouldn't go over 50K partitions. I am conservative and will suggest no more than 50K partitions per table. Remember, sub partitions like "year/month/date" will multiply faster. So if you have daily data, then see the number of partitions you will have if you partition by day vs month and ideally in my opinion, don't go over 50K partitions. Partition is for data with low cardinality.
Now, you may also have data like "id" which have high cardinality. Use bucketing for this. Bucketing will hash your ids to a fix number of buckets. This certainly helps with joins when Hive knows that id x will always hash to bucket x. This comes handy when you join two tables based on bucketing column.
2. A fact table has many key columns and indexes. How do we use buckets or partitioning on those tables?
Bucket columns with high cardinality and partition on low cardinality columns. Partitioning will reduce the amount of data that will be physically scanned. This is very important for running efficient queries.
3. In SQL - we used to create views/derived tables that would be used as input to ML development. What could be the alternative here? A new hive table by joining the other hive tables? Or use spark to load the different hive tables and join using a pyspark workflow to do the heavy lifting?
Views in Hive are logical and not materialized - someone can correct me if this has changed, but my understanding is views are just logical, so not sure if this would help.
I like the pyspark approach. Create a temp table and run your workflow and drop it at the end.
4. For the large fact tables loaded in hdfs and hive which typically has many key columns (and indexes on SQL) how do we optimize the access while joining them? Create multiple version of the same table with different bucketing/partitioning keys?
I think this is same as question 1 and 2. I think partitioning will be just as good as creating multiple tables unless you have 100's of thousands pf partitions. In that case may be divide your table in more tables but you usually don't really need to do this. Use partitioning with bucketing and you should be able to have good performance.