What's New @ Cloudera

Find the latest Cloudera product news

Unifying Analytics in Cloudera Data Warehouse

avatar
Cloudera Employee

Data warehouse users require constantly improving performance, regardless of data volumes or number of end users, with tools that are ever easier to use. Cloudera just released a capability that answers each of those calls, providing significant performance improvements that apply regardless of scale and that are almost trivial to set up. We are pleased to announce the general availability of Unified Analytics within Cloudera Data Warehouse (CDW) - our cloud native data warehouse service available in Cloudera Data Platform (CDP).

 

Unified Analytics (UA) boosts performance by orders of magnitude for BI users via intelligent materializations that are applied transparently. This includes materialized views, results caching, and approximate algorithms. It further increases performance by enhancing our query optimizer,  which especially benefits the complex analytic queries that are increasingly common in lakehouse style architectures. Lastly it simplifies development for the data practitioner by exposing a single front end for both BI and ETL style queries. 

 

Cloudera has long had the most scalable, performant, and cost effective data warehouse. With UA we have raised the bar even higher. In this blog we will walk you through the new and exciting UA feature set. We will explain why this capability is important, and dive deeper into the key features that it offers. UA is available now in the public cloud version of CDW, and will be available in the private cloud version around mid-year.

 

Cloud-native data warehouses, like CDW, are fast becoming the de-facto industry standard. They dramatically shorten time to value, lower administrative burdens, and promise continuous agility in response to changing business demands. This allows them to appeal not just to backend IT experts, but also to front end users in the Lines of Businesses (LOBs). 

 

CDW was designed to take cloud native data warehousing to the next level, in terms of performance and ease of use. It offers two highly scalable and performant Apache open-source SQL query engines – Impala and Hive LLAP – both proven in over a thousand enterprises for business-critical data warehousing. However, having the two SQL engines in a single data warehousing service, while offering versatility, also causes adoption friction. Therefore, a key focus for CDW is to further simplify the service to offer better user experience while supporting a variety of BI use cases that the two engines offer. The UA capabilities being introduced in CDW is a major milestone in this journey.

 

Why introduce Unified Analytics?

 

In addition to the general goals of significantly improving performance for BI workloads, we also set out to solve the following user experience and adoption challenges with the UA project:

 

  • Unify Hive LLAP and Impala such that they are as transparent to the end users as possible. UA enables a common SQL syntax and semantics across the two engines such that existing applications can be migrated to CDW with minimal rewrites. UA also allows application end users to focus on business logic instead of worrying about specifics of the two query engines.
  • Align and package features inside these two powerful SQL engines such that they can be prescribed and tailored to serve specific data warehouse workloads extremely efficiently. We prescribe Hive LLAP for ETL and Impala for BI workloads.
  • Introduce innovations/features more rapidly under the new UA framework to enable new use cases that push the boundaries of CDW’s value proposition.

 

What are the key features in Unified Analytics?

 

In simple terms - UA extends the capabilities of the two existing CDW SQL query engines, Hive and Impala, to support a broader set of EDW use cases. It not only offers a common front end for the engines, but also provides interoperability, understandability, and a simplified user experience. Figure 1 gives a quick overview of UA in CDW.

 

amansinha_0-1651456565346.png

Figure 1: Unified Analytics in CDW

 

The key features in UA are described below:

 

  • Common SQL front end with pluggable query optimization

One of the primary capabilities offered by UA is the support for common SQL syntax and semantics across both Hive LLAP and Impala. This includes common DDL, DML, and other auxiliary statements. It also guarantees backwards compatibility for functions, data types, and configuration properties. Finally, the SQL front end is based on the ANSI SQL 2016 standard, meaning that it supports most mandatory and many optional features from the standard covering set operators, grouping sets, and complex subqueries.

 

The query optimization is now multi-phase, with the first phase (logical plan creation) being common to both engines, and the second phase (physical plan creation) specific to each individual engine’s operators. In a follow up blog we will give a deeper insight into the optimization architecture. The first (common) phase is based on the Apache Calcite framework, which enables pluggable rules and cost based optimization techniques that can be selectively injected depending on the engine in use. These optimizations cover a wide variety of areas such as join reordering, materialized view rewrites, integrity constraints based rewrites, join aggregate pushdowns, constant folding, predicate simplification, column pruning, and many more. Specifically for the Impala engine - many such optimization techniques are net new additions, making it an even better choice for complex and large scale BI implementations on CDW.

 

  • Consistency (ACID) guarantees

CDW already offers consistency guarantees at various levels through its ACID features. The UA project retains these key capabilities and builds on them to provide the necessary ACID guarantees for row level transactions (insert, update, delete, merge). These guarantees are available as snapshot isolation using the Managed Table format. Also, UA continues to honor queries directed at the non-transactional External tables format as well. The key functional support levels between the various table formats and engines in UA are summarized as below:

READ

Impala

Hive LLAP

ORC File Format

FULL ACID

FULL ACID

Parquet File Format

INSERT ONLY

INSERT ONLY

     

WRITE

   

ORC File Format

NOT SUPPORTED

FULL ACID

Parquet File Format

INSERT ONLY

INSERT ONLY

 

  • Intelligent data materializations - Materialized Views, Result Cache, and Data Sketches

As outlined earlier, one of the mandates for UA is to bolster the two engines towards their prescribed use cases: Impala for BI and Hive LLAP for ETL. On that front, many new features were added to Impala to strengthen its position as an industry leader for large scale BI use cases. Some of these features were described earlier as a collection of optimization techniques that help BI queries. In addition, three key features were added to help the BI cause. 

 

First amongst them is Materialized Views - a feature that helps with OLAP style queries that slice/dice grouped data sets. With UA, CDW offers Materialized Views in the Impala query engine (Hive already supports this feature). Materialized Views can accelerate query execution by orders of magnitude. They provide a framework for creating views of commonly occurring query patterns that could be built on arbitrarily complex SQL constructs (joins, aggregations, subqueries). Once created, UA allows the CDW query optimizer to look for fully or partially matching query predicates so that they can be automatically rewritten to use matching materialized views, thereby short-circuiting the query execution and returning the results much faster from the materialized result set. The created materialized views can be maintained as the underlying data changes by refreshing them at scheduled intervals. 

 

Second, building on the theme of returning quick results for repetitive queries originating from BI clients - UA will also support (in an upcoming release) a feature called Result Cache. Unlike materialized views this feature does not require any construct to be created. Instead it relies on ‘exact match’ queries to fetch the results from a persistent cache, thus avoiding expensive query execution. Unlike materialized views, however, Result Cache usage is not triggered for partial match queries.

 

Finally, another interesting feature around materialized results that is now available within UA is that of Data Sketches. As data volumes grow, it becomes increasingly difficult to return fast query results despite all the available advanced query optimization techniques. In CDW, a new feature called “BI Mode” offers relief on this front and is useful for queries that do not require 100% accurate answers - i.e. approximations are acceptable. This feature, implemented using the Apache DataSketches framework, allows certain queries to obtain fast answers using materialized approximations - typically useful aggregations such as, COUNT DISTINCT, NTILE, RANK, and the like. The Data Sketches created with UA are compatible between the CDW SQL engines. 

 

  • Security and Governance Integration

One of the key strengths of CDW, and of CDP more broadly, is the promise of a secure and governed service through its Shared Data Experience (SDX) feature set, which is production ready for demanding enterprise grade deployments. In UA, these characteristics are not just honored, but also extended further. 

The Apache Ranger security framework has been enhanced to provide parity between the two CDW SQL engines in key areas such as access control, row-level filtering, and column masking. These capabilities are critical in the BI/DW context as they are widely used in the industry. The governance framework using Apache Atlas is already available to CDW for full data lineage traceability. With UA, the lineage traceability has been extended to also include Materialized Views, so BI users can see the full lineage of a materialized view’s origins.

 

  • Utilities, tools, and platforms for enterprise grade BI

The UA project introduces and/or integrates with a slew of value added tools that significantly improve user experience and appeal. The key tools made available in the UA project are:

 

  • Recommender tool for Materialized Views

As explained, Materialized Views are extremely powerful, but it is not always easy for end users to model them. To make modeling simple, UA also introduces a utility called the Materialized View Recommender, which analyzes query patterns in a workload and recommends the exact materialized view to create for the recurring query patterns. Initially, the recommender will be based on a CLI but we will soon be adding a GUI interface through CDW’s existing tools.

 

  • Integration with CDW UI tools

The UA project is fully integrated with the CDW console (see Figure 2) which allows users to easily turn on/off UA capabilities. It is also supported through HUE, our SQL query editor. In the future, UA will work with Cloudera Data Visualization and Cloudera Workload Manager tool to enable higher productivity and faster time to value.

amansinha_1-1651456671098.png

Figure 2: UA integrated into CDW console

 

  • Impact on Performance

As you can see, the UA project packs a punch when it comes to the rich set of query optimization techniques. To analyze the performance impact of these optimizations we used the standardized TPC-DS benchmark in our labs and the early results are quite promising especially for the complex queries. For the simpler queries, currently we see marginal improvements.  With further tuning and optimization we expect to get more improvements across the board. When ready, we plan to publish the results as we have done here before. 

 

UA is generally available in the public cloud version of CDW. It will be available by mid-year in the private cloud version of CDW. Please give it a try and let us know how it goes.


In conclusion, through UA’s profile outlined in this blog, we consider the UA project to be a watershed one in CDW’s journey towards market leadership. It starts out with the goal of simplifying the CDW service - making it easy for our end users to run their BI workloads on CDW. But the benefits go far beyond. UA enables many new optimization techniques that will drive large scale production grade BI applications. We are beginning to publish blueprints for these applications through our CDP Patterns initiative. We expect the UA capabilities to be a driving force behind adoption of CDP Patterns. Finally, UA improves the price/performance characteristics of CDW even further. In future blogs, we will outline details of how CDW end users can quickly deploy and use the UA features so that they can experience UA up close.