Community Articles

Find and share helpful community-sourced technical articles.
Celebrating as our community reaches 100,000 members! Thank you!
Labels (1)
Cloudera Employee

With Stochastic Streaming Algorithms it's possible to analyze highly efficient and performant massive amounts of data. The main characteristics of sketches are very small size (sub-lineare space consumption), single-pass, mergeable and deliver approximate results in proven error bounds. If an approximate result is acceptable, then it is possible that data sketches are orders-of-magnitude faster compared to classic indexing or aggregation techniques. (I don't go into the mathematical discussion, more details here DataSketches)

The following example shows how to use Data Sketches within Hive (also possible with Druid):

  • Download and install Hortonworks Sandbox (it comes with demo database: foodmart)
  • Download Java artifact sketches-hive-0.11.0-with-shaded-core.jar (provided by Yahoo) here jar repo and copy the jar file into HDFS /apps/hive/warehouse/lib directory
#Login as hive user hive@sandbox-hdp

wget -O sketches-hive-0.11.0-with-shaded-core.jar

hdfs dfs -mkdir /apps/hive/warehouse/lib 
hdfs dfs -put sketches-hive-0.11.0-with-shaded-core.jar /apps/hive/warehouse/lib/
  • Create functions (here permanently) to create and manage the sketches with CLI beeline or Data Analytics Studio (DAS)
beeline -u "jdbc:hive2://;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2" -n hive --force=true
set hivevar:hivesketch_jar=hdfs:///apps/hive/warehouse/lib/sketches-hive-0.11.0-with-shaded-core.jar;
create database sketches;
use sketches; 
create function qtl_data2sketch as '' USING JAR '${hivesketch_jar}' ;
create function qtl_getQuantile as '' USING JAR '${hivesketch_jar}';

  • Create a data sketch for product and price with the function qtl_data2sketch. The table product_sketch has two columns product_id that contains the products sold and a secondary binary field for the data sketch itself.
use foodmart;
create table product_sketch (product_id double, qtl_sketch binary);
insert into product_sketch select cast(product_id as double), sketches.qtl_data2sketch(cast(store_sales as double)) from sales_fact_1997 group by product_id;

The table product_sketch stores for every product the price sketch of the year (1997). The data sketch of the product can now retrieved i.e. to get a quantile of the data below 25%.

select product_id, sketches.qtl_getQuantile(qtl_sketch, 0.25)  from product_sketch limit 3;
Product25% Quantile

Use in a more advanced example i.e. to find sales where product revenue was below the 25% quantile.

select sales.product_id, sum(store_sales * unit_sales ) sum_sales, sum( unit_sales) units, store_sales, sketches.qtl_getQuantile(qtl_sketch, 0.25) 25pct
from sales_fact_1997 sales
join product_sketch
on (product_sketch.product_id = sales.product_id
and store_sales < sketches.qtl_getQuantile(qtl_sketch, 0.25))
group by sales.product_id , store_sales, sketches.qtl_getQuantile(qtl_sketch, 0.25)
order by sum_sales desc
limit 10;

Table shows the top 3 products and revenue where the price is lower 25 % quantile of the price span (per product).