Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Combinational agg over multi-dimensional table

Solved Go to solution
Highlighted

Combinational agg over multi-dimensional table

Explorer

Hello.

With this table,

CountryGenderEventUser
USFA#1
USFB#1
CAFA#2
INMA#3

It is simple to get 3-dimension olap cube data by

select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event

https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html

But it's result is not what I expected.

What I've already done by Hadoop-M/R Jobs is like below.

- every combination of dimensions should be calculated.

- each condition have two metric :

- count(*) AND count(distinct userId)

CountryGenderEventEvent CountUser Count
US--21
USF- 22
USFA11
USFB11
F32
FA11
FB11
A33
B11
CA11
CAF11
CAFA11
IN11
INM11
INMA11
M11
MA11
INA11

I have successfully did it with hadoop mapreduce.

But. As raw data size getting large MR job takes too long time.!!!!

( 300 millions logs takes 6 hours )

Of course it can be generated by multiple spark sql queries

- but we have 5~10 dimensions in real world.

- and It's cost seems not cheep two.

# 1 dimensions
select country, count(*), count(distinct userId) from TABLE group by country
select gender, count(*), count(distinct userId) from TABLE group by gender
select event, count(*), count(distinct userId) from TABLE group by event

# 2 dimensions
select country, event, count(*), count(distinct userId) from TABLE group by country, event
select country, gender, count(*), count(distinct userId) from TABLE group by country, gender
select gender, event, count(*), count(distinct userId) from TABLE group by gender, event

# 3 dimensions
select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event

So I consider to convert this MR job to Spark's something.

I have never found some examples fit to solve this problem.

Is there any good reference about this kind works? Any Idea?

( I know Druid provide similar function to this but is is not an option for us for some reasons.)

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Combinational agg over multi-dimensional table

Expert Contributor

Add "WITH ROLLUP" or "WITH CUBE" to the end of the query, like:

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH ROLLUP

or

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH CUBE

View solution in original post

4 REPLIES 4
Highlighted

Re: Combinational agg over multi-dimensional table

Expert Contributor

Add "WITH ROLLUP" or "WITH CUBE" to the end of the query, like:

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH ROLLUP

or

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH CUBE

View solution in original post

Highlighted

Re: Combinational agg over multi-dimensional table

Explorer

OMG.

"With Rollup" is definitely what I need.

I have never seen this query option.

Looks almost sql engines supports that.

Great thanks.

Highlighted

Re: Combinational agg over multi-dimensional table

Explorer

By any chance,

Does OLAP functions(cube,rollup,grouping sets) supports only HiveSqlContext in SparkSQL?

scala> sql=new org.apache.spark.sql.SQLContext(sc)
scala> var json_df = sql.read.json("/data/jpl/band/2016/10/06/raw_json_tiny/*/*")
scala > json_df.registerTempTable("json_tb")
scala> var result = sql.sql("select service_id, product, os_name, action_id, classifier, language, country, app_ver, count(*), count(distinct device_id), count(distinct user_key) from json_tb group by service_id, product, os_name, action_id, classifier, language, country, app_ver with rollup")

java.lang.RuntimeException: [1.253] failure: ``union'' expected but `with' found
Highlighted

Re: Combinational agg over multi-dimensional table

Explorer

With spark 1.6

rollup/cube/groupingsets are not possible by query syntax but possible dataframe api.

this works.

var agg_result = json_df.select("scene_id","action_id","classifier","country","os_name","app_ver","user_key","device_id").cube("scene_id","action_id","classifier","country","os_name","app_ver").agg(count("user_key"), countDistinct("user_key"), countDistinct("device_id"))
Don't have an account?
Coming from Hortonworks? Activate your account here