Created on 03-27-201803:56 PM - edited 09-16-202201:42 AM
AD (After Druid)
In my opinion, Druid creates a new analytic service that I term the real-time EDW. In traditional EDWs the process of getting from the EDW to an optimized OLAP could take hours depending on the size of the EDW. Having a cube take six or more hours was not unusual for a lot of companies. The process also tended to be brittle and needed to be closely monitored.
In addition to real-time, Druid also facilitates long-term analytics which effectively provides a Lambda architecture for your data warehouse. Data streams into Druid and is held in-memory for a configurable amount of time. While in-memory the data can be queried and visualized. After a period of time the data is then passed to long-term (historical) storage as segments on HDFS. These segments can also be part of the same visualization as the real-time data.
As mentioned previously, all data in Druid contains a timestamp. The other data elements consist of the same properties as traditional EDWs: dimensions and measures. The timestamp simplifies the aggregation and Druid is completely denormalized into a single table. Remember dimensions are descriptions or attributes and measures are always additive numbers. Since this is always true, it is easy for Druid to infer in the data what are dimensional attributes and what are measures. For each timestamp duration Druid can, in real-time, aggregate facts along all dimensional attributes. This makes Druid ideal for topN, timeseries, and group-by with group-by being the least performant.
The challenges around Druid and other No-SQL type technologies like MongoDB is in the visualization layer as well as the architectural and storage complexities. Durid stores json data and the json data can be difficult to manage and visualize in your standard tools such as Tableau or PowerBI. This is where the integration between Druid and Hive becomes most useful. There is a three-part series describing the integration:
The integration provides a single pane of glass against real-time pre-aggregated cubes, standard Hive tables, and historical OLAP data.
More importantly, the data can be accessed through standard ODBC and JDBC visualization tools as well as managed and secure3d through Ambari, Ranger, and Atlas. Druid provided out-of-the-box lambda architecture for time-series data and, coupled with Hive, we now provide for the flexibility and ease-of-access associated with standard RDBMS’s.