There is a possibility that Cloudera Impala has some different characteristics. ( It works best if you join one big table with small dimension tables that fit into main memory ) .
However for Hive in general I can give some tips. Historically you were supposed to keep your schema simple and flat because Hive didn't provide the necessary features of a Warehouse.
However Hive has advanced a lot. It now has a good cost based optimizer which is a prereq for complex queries, statistics and other performance features like ORC files with column storage and predicate pushdown.
So to keep it short you can design a star schema as you would design a star schema in any other data warehouse.
- I would still not go overboard with too deep dimensions though. Keep it simple. Storage is cheap so you do not have to normalize everything if that provides better performance.
- Hive ACID is very new so you should ideally design your dimensions in a way that they can be reloaded completely or that you can recreate them using CTAS tables. Recreating new datasets is also cheap.
But apart from that the rules are not too different from other databases.
- Fact table entries should not change once created ( apart perhaps from adding columns )
- Partition fact table by date ( make sure partition pruning is used in queries)
- Sort data during inserts based on other heavily filtered column for predicate pushdown
- As said space is cheap so denormalize if needed
- Use Analyze commands to generate statistics
- Check query graph that correct join type is used ( mapside join for join of small with big table )
Hi @sukumar balla,
In addition to @Benjamin Leonhardi's comment, if you work with star schema in Hive/Impala, take a step back and reflect if you need to and if the overall data flow in your infrastructure is correct. It may be reasonable though it can be a sign of a bigger issue.
Avoid copying and pasting your existing data storage and processing strategies from RDBMS to Hive and Impala. Bear with my answer; it may seem a little tangential to your question but I think it is essential to understand the strategic context within which one works to make reasonable tactical, day-to-day decision.
Companies often suffer from fragmented storage systems (silos) and the incoherent and disagreeing data that has to be joined manually with traditional systems struggling with (for their design) expensive table scans. As a quick fix organisations often copy data, e.g. via Sqoop, into their Hadoop cluster to do some analysis or ETL and then export it back into their RDMBS or access it via ODBC in the cluster. And usually, they are dissatisfied with the outcome. This is not an issue of the tooling though and more related to who it is being used.
I have seen a lot of people approach Hive and Impala in the same fashion as they have been used to with RDBMS in the past. Naturally, there are a few differences, i.e. that an RDMBS can be optimised for small data access and joining via indices while the big data approach is optimised for large-scale data processing. While both can be bent to be used for the either purpose (see Hive's newer ACID support and columnar storage via ORC, for example) planning your infrastructure correctly can remove a lot of pain.
Think about what are the 'raw' immutable events you want to store and how to you want to represent them based on your needs, e.g. access patterns and analytics. Then consider if you can reorganise the flow of data in your organisation (step by step) to cut down on the existing systems. If you don't need near real-time processing you can, for example, copy data from source systems (logs, CSV, JSON messages, etc) regularly into HDFS and store it as immutable facts. Try and modify it as little as possible and only conform it (e.g. date normalisation) and optimise it (compress and partition the storage locations). This data will be the foundation of your data lake. On top of this dataset, you can run regular processes to build derived, optimised datasets (think of them as something alike materialised views) which may join various core datasets and even include denormalisation or join in facts as needed. This data may also be exported into data marts for optimised access patterns or third party systems.
Note, that if you use your data lake as a multi-tenancy environment to ensure that you have data governance and appropriate access limitations in place to prevent people circumventing your overall data flow pattern (ideally uni-directional) and worse build business critical processes on top of unsupported datasets.
Hi@ sukumar balla
If the goal of the star schema is to provide analytics to the end user, you could also consider advanced BI tools like Platfora for data discovery, analytics & reporting. It integrates nicely with data stored in Hadoop and leverages its computational abilities, i.e. does a lot of the processing in YARN. Platfora provides an interest driven workflow where the raw data of interest is pulled into a 'lens' cutting down the time it takes to ETL/ELT using traditional Pig/Hive scripts. This is done via a web interface making analytics accessible to a large group of users in the organisation. Additionally, the lens building process is agile due to Platfora firing Spark jobs in the background using the Hadoop cluster to build the lens. The end users can draw pixel perfect reports/charts using drop zones on the Vizboards in little time.