Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Combinational agg over multi-dimensional table

avatar
New Member

Hello.

With this table,

CountryGenderEventUser
USFA#1
USFB#1
CAFA#2
INMA#3

It is simple to get 3-dimension olap cube data by

select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event

https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html

But it's result is not what I expected.

What I've already done by Hadoop-M/R Jobs is like below.

- every combination of dimensions should be calculated.

- each condition have two metric :

- count(*) AND count(distinct userId)

CountryGenderEventEvent CountUser Count
US--21
USF- 22
USFA11
USFB11
F32
FA11
FB11
A33
B11
CA11
CAF11
CAFA11
IN11
INM11
INMA11
M11
MA11
INA11

I have successfully did it with hadoop mapreduce.

But. As raw data size getting large MR job takes too long time.!!!!

( 300 millions logs takes 6 hours )

Of course it can be generated by multiple spark sql queries

- but we have 5~10 dimensions in real world.

- and It's cost seems not cheep two.

# 1 dimensions
select country, count(*), count(distinct userId) from TABLE group by country
select gender, count(*), count(distinct userId) from TABLE group by gender
select event, count(*), count(distinct userId) from TABLE group by event

# 2 dimensions
select country, event, count(*), count(distinct userId) from TABLE group by country, event
select country, gender, count(*), count(distinct userId) from TABLE group by country, gender
select gender, event, count(*), count(distinct userId) from TABLE group by gender, event

# 3 dimensions
select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event

So I consider to convert this MR job to Spark's something.

I have never found some examples fit to solve this problem.

Is there any good reference about this kind works? Any Idea?

( I know Druid provide similar function to this but is is not an option for us for some reasons.)

Thanks!

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Add "WITH ROLLUP" or "WITH CUBE" to the end of the query, like:

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH ROLLUP

or

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH CUBE

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Add "WITH ROLLUP" or "WITH CUBE" to the end of the query, like:

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH ROLLUP

or

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH CUBE

avatar
New Member

OMG.

"With Rollup" is definitely what I need.

I have never seen this query option.

Looks almost sql engines supports that.

Great thanks.

avatar
New Member

By any chance,

Does OLAP functions(cube,rollup,grouping sets) supports only HiveSqlContext in SparkSQL?

scala> sql=new org.apache.spark.sql.SQLContext(sc)
scala> var json_df = sql.read.json("/data/jpl/band/2016/10/06/raw_json_tiny/*/*")
scala > json_df.registerTempTable("json_tb")
scala> var result = sql.sql("select service_id, product, os_name, action_id, classifier, language, country, app_ver, count(*), count(distinct device_id), count(distinct user_key) from json_tb group by service_id, product, os_name, action_id, classifier, language, country, app_ver with rollup")

java.lang.RuntimeException: [1.253] failure: ``union'' expected but `with' found

avatar
New Member

With spark 1.6

rollup/cube/groupingsets are not possible by query syntax but possible dataframe api.

this works.

var agg_result = json_df.select("scene_id","action_id","classifier","country","os_name","app_ver","user_key","device_id").cube("scene_id","action_id","classifier","country","os_name","app_ver").agg(count("user_key"), countDistinct("user_key"), countDistinct("device_id"))