Community Articles

Find and share helpful community-sourced technical articles.
avatar

Kylin Pronounced “KEY LIN” / “CHI LIN” - This project brings OLAP (Online Analytical Processing) to Big Data. It is a top-level project in Apache. Through it’s UI, you can create a logical model (dimensions/measures) from a star schema in Hive. Kylin will then create cube aggregates using MR and put the aggregates and cube metadata into HBase. Users can then query the cube data through the Kylin UI or a BI tool that uses the Kylin odbc driver.

A good video from the committers overviewing the project: https://www.youtube.com/watch?v=7iDcF7pNhV4

Definitions

Cube - A data structure containing dimensions and measures for quickly accessing aggregated information (measures) across many axis's (dimensions)

Cuboid - A "slice" or subset of a cube

Dimensions - Think of these as alphanumeric columns that sit in a group by clause of SQL. i.e. Location, Department, Time, etc.

Measure - Think of these as metric/numerical values that sit in a select clause of SQL. i.e. Sum(value), Max(bonus), Min(effort)

Technical Overview

Kylin needs HBase, Hive and HDFS (Nice!) Regarding HDFS, it does alot of processing in MR by creating aggregate data for each N-Cuboid of a cube. These jobs output HFiles for HBase. In turn, HBase stores cube metadata and cube aggregates in HBase. This makes sense for quick fetching of aggregate data. For cube aggregate levels in HBase, dimensions are row keys in HBase, columns are the measure values. Hive is used for the data modeling. Data needs to be in star schema like format in Hive. Also, base level data resides in Hive and not the cube. The cube contains only aggregate data.

The Good

- Use Kylin if you have alot of interactive querying on a smaller number of dimensions, your measures/metrics are simple aggregates and the data doesn't need to be viewed in real-time.

- Sql ansi compliant

- Connectivity to BI tools

- Can use hierarchies

- Needs HDFS, HBase & Hive

- Has a UI

- Does incremental cube updates

- Uses Calcite for Query optimizer

Cautions

- MR overhead with building cubes (“query yesterdays data”). Lots of shuffling. Does aggregations on the reduce side

- No cell level security. Security at a cube and project level.

- Simple measures only (counts, max, min and sum). No custom calcs, ratios, etc.

- 20 dimensions seem like a practical upper limit

- For larger cubes, it does pre-aggregation and then aggregation at runtime (may result in query latencies)

- No Ambari view

Security

There is security on projects and cubes, no cell level security. One idea around security is to create smaller cubes (i.e. segments) to create security for users / groups. LDAP is also an option.

What's in HBASE?

Metadata and cube data. If you list the tables in HBase, you’ll see this:

KYLIN_XXXXXXXXXXX (This is the Cube)

kylin_metadata

kylin_metadata_acl

kylin_metadata_user

Other Thoughts...

  • Kylin has its own ODBC driver and can be used with Tableau / Excel. With Tableau, make sure you connect with Live data as opposed to import.
  • Kylin only puts aggregates in Hbase, base level data is still in Hive. (I.e. Kylin doesn’t do table scans)
  • eBay (26TB / 16B rows) -> 90% of queries with <5sec latency
  • MDX adoption is very low, therefore its not currently supported
  • You can build up a cube of cubes (daily -> weekly —>monthly, etc). These are called segments. The more segments the slower performance can get (more scans)

Roadmap

Streaming Cubes

Spark

1) Thinking about using Spark to speed up cubing MR jobs

2) Source from SparkSQL instead of Hive

3) Route queries to SparkSQL

16,298 Views
Comments
avatar
Contributor

Hello ccasano,

This is Shaofeng from Apache Kylin community; Thanks for the trial and compose such a good summary on Kylin, we think it is very helpful for end users;

Here I want to add some comments regarding some of your questions:

1. "base level data resides in Hive and not the cube. The cube contains only aggregate data."

comment: you're correct; so far Kylin doesn't save the raw data; but it has been under development; In a future release, user will be able to query both aggregated data as well as raw data from Kylin: https://issues.apache.org/jira/browse/KYLIN-1122

avatar
Contributor

2. "MR overhead with building cubes (“query yesterdays data”). Lots of shuffling. Does aggregations on the reduce side"

comment: you're right, as-is cube algorithm may cause a lot of shuffling among the hadoop nodes; We have realized this and introduced a new algorithm called "fast-cubing", which will do mapper side aggregation, aming to reduce the network IO and reduce the MR time; it will be released in Kylin v2.0.

avatar
Contributor

3. "Simple measures only (counts, max, min and sum). No custom calcs, ratios, etc."

comment: the custom measure support is on the way, and will be released soon: https://issues.apache.org/jira/browse/KYLIN-976

4. "16 dimensions seem like an upper limit, but that's not confirmed."

comment: the real upper limit is 64, but usually we suggest user to control the cube expansion rate by picking no more than 20 dimensions;

avatar
Contributor

5. “MDX adoption is very low, therefore its not currently supported”

comment: If anyone want to do MDX using Kylin, we suggest try Mondrian, some users has been able to run that successfully, here is a document: https://www.inovex.de/fileadmin/files/Vortraege/2015/big-data-mdx-with-mondrian-and-apache-kylin-seb...

avatar
Contributor

6. "The more segments the slower performance can get (more scans)"

comment: Yes, usually we suggest user to control the segment number e.g no more than 10, to gain a better query performance; But if the partition date column appears in the filtering condition, Kylin can smartly skip those unrelated segments, for this case, the segment number doesn't matter.

avatar
Contributor

(sorry for splitting my comments into several posts, as this forum doesn't allow comment more than 600 characters... )

Welcome to join our dev mailing list dev@kylin.apache.org (send to dev-subscribe@kylin.apache.org to subscribe), there are a lot of discussions;

Regards,

shaofengshi@apache.org

avatar
Master Mentor

@Shaofeng Shi Thanks for sharing all the comments. I wonder if it;s possible to post them as an article...Please