Community Articles

SQLShaw · ‎03-27-2018

Druid is an OLAP solution for streaming event data as well as OLAP for long-term storage. All Druid data requires a timestamp. Druid’s storage architecture is based off the timestamp similar to how HBase stores by key. Following are some key benefits of Druid:

Real-time EDW on event data (time-series)
Long-term storage leveraging HDFS
High availability
Extremely performant querying over large data sets
Aggregation and Indexing
High-level of data compression
Hive integration

Druid provides a specific solution for specific problems that could not be handled by any other technology. With that being said, there are instances where Druid may not be a good fit:

Data without a timestamp
No need for real-time streaming
Normalized (transactional) data (no joins in Druid)
Small data sets
No need for aggregating measures
Non-BI queries like Spark or streaming lookups

Why Druid?

BD (Before Druid)

In traditional EDWs data is broken into dimensional tables and fact tables. Dimensions describe an object. For example, a product dimension will have colors, sizes, names, and other descriptors of product. Dimensions are always descriptors of something whether it is a product, store, or something that is part of every EDW, date. In addition to dimensions, EDWs have facts, or measures. Measures are always numbers that can be added. For example, the number 10 can be measure but averages cannot. The reason is that you can add 10 to another number, but adding 2 averages does not make numerical sense.

The reason for dimensions and facts is two-fold; firstly, it was a means to denormalize the data and reduce joins. Most EDW’s are architected so that you will not need more than 2 joins to get any answer; secondly, dimensions and facts easily map to business questions (see Agile Data Warehouse Design in the reference section). For example, take the following question:

“How many product x were purchased last month in store y”?

We can dissect this sentence in the following way. product, month, and store are all dimensions while the question “how many” is the fact or measure. For that single question you can begin building your star schema:

Figure 1: Star Schema

The fact table will have a single row for each unique product sold in a particular store for particular time frame. The difference between an EDW and OLAP is that an OLAP system will pre-aggregate this answer. Prior to the query you will run a process that anticipates this question and will add up all the sales totals for all the products for all time ranges.

This is fundamentally why in traditional EDW development all possible questions needed to be flushed out prior to building the schemas. The questions being asked define how the model is designed. This makes traditional EDW development extremely difficult, prone to errors, and expensive.

Interviewing LOBs to find what questions they may ask the system or, more likely, looking at existing reports and trying reproduce the data in an EDW design was only the first step. Once the EDW was built you still had to work on what is called the “semantic layer”. This is the point where you instruct the OLAP tool how to aggregate the data. Tools like SQL Server Analysis Server (SSAS) are complicated tools and require a deep understanding of OLAP concepts. They are based off the Kimball methodology and therefore to some extent require the schema to look as much like a star schema as possible.

Figure 2: SSAS

In these tools the first thing you needed to do was define hierarchies. The easiest hierarchy to define is date. Date always follows the pattern: year,month,day,hour,seconds. Other hierarchies include geography:country, state, county, city, zip code. Hierarchies are important in OLAP because they describe how the user will drill through the data and how the data will be aggregated at each level of the hierarchy. The semantic layer is also where you define what the analyst will actually see in their visualization tools. For example, exposing and EDW surrogate key would only confuse the analyst. In the hadoop space the semantic layer is handled by vendors and software like Jethrodata, AtScale, Kyvos, and Kylin (open source).

Cloudera Community

Community Articles

Why Use Druid?

Apache Hadoop

Apache HBase

Apache Hive

Apache Spark

HDFS

Why Druid?

BD (Before Druid)

Batch Loading with Druid + Hive

Druid - Part 1

EDW After Druid

Druid Kafka Integration Service + Hive

Streaming Tweets with NiFi, Kafka, Tranquility, Dr...

Druid - Part 2

Integrating Nifi to Druid with a custom processor

Druid Coordinator Failing seconds after starting

Druid Installation

Social Media Monitoring with NiFi, Hive/Druid Inte...