Kylin Pronounced “KEY LIN” / “CHI LIN” - This project brings OLAP (Online Analytical Processing) to Big Data. It is a top-level project in Apache. Through it’s UI, you can create a logical model (dimensions/measures) from a star schema in Hive. Kylin will then create cube aggregates using MR and put the aggregates and cube metadata into HBase. Users can then query the cube data through the Kylin UI or a BI tool that uses the Kylin odbc driver.
Cube - A data structure containing dimensions and measures for quickly accessing aggregated information (measures) across many axis's (dimensions)
Cuboid - A "slice" or subset of a cube
Dimensions - Think of these as alphanumeric columns that sit in a group by clause of SQL. i.e. Location, Department, Time, etc.
Measure - Think of these as metric/numerical values that sit in a select clause of SQL. i.e. Sum(value), Max(bonus), Min(effort)
Kylin needs HBase, Hive and HDFS (Nice!) Regarding HDFS, it does alot of processing in MR by creating aggregate data for each N-Cuboid of a cube. These jobs output HFiles for HBase. In turn, HBase stores cube metadata and cube aggregates in HBase. This makes sense for quick fetching of aggregate data. For cube aggregate levels in HBase, dimensions are row keys in HBase, columns are the measure values. Hive is used for the data modeling. Data needs to be in star schema like format in Hive. Also, base level data resides in Hive and not the cube. The cube contains only aggregate data.
- Use Kylin if you have alot of interactive querying on a smaller number of dimensions, your measures/metrics are simple aggregates and the data doesn't need to be viewed in real-time.
- Sql ansi compliant
- Connectivity to BI tools
- Can use hierarchies
- Needs HDFS, HBase & Hive
- Has a UI
- Does incremental cube updates
- Uses Calcite for Query optimizer
- MR overhead with building cubes (“query yesterdays data”). Lots of shuffling. Does aggregations on the reduce side
- No cell level security. Security at a cube and project level.
- Simple measures only (counts, max, min and sum). No custom calcs, ratios, etc.
- 20 dimensions seem like a practical upper limit
- For larger cubes, it does pre-aggregation and then aggregation at runtime (may result in query latencies)
- No Ambari view
There is security on projects and cubes, no cell level security. One idea around security is to create smaller cubes (i.e. segments) to create security for users / groups. LDAP is also an option.
What's in HBASE?
Metadata and cube data. If you list the tables in HBase, you’ll see this:
KYLIN_XXXXXXXXXXX (This is the Cube)
Kylin has its own ODBC driver and can be used with Tableau / Excel. With Tableau, make sure you connect with Live data as opposed to import.
Kylin only puts aggregates in Hbase, base level data is still in Hive. (I.e. Kylin doesn’t do table scans)
eBay (26TB / 16B rows) -> 90% of queries with <5sec latency
MDX adoption is very low, therefore its not currently supported
You can build up a cube of cubes (daily -> weekly —>monthly, etc). These are called segments. The more segments the slower performance can get (more scans)
1) Thinking about using Spark to speed up cubing MR jobs