Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Combinational agg over multi-dimensional table

avatar
Contributor

Hello.

With this table,

CountryGenderEventUser
USFA#1
USFB#1
CAFA#2
INMA#3

It is simple to get 3-dimension olap cube data by

select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event

https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html

But it's result is not what I expected.

What I've already done by Hadoop-M/R Jobs is like below.

- every combination of dimensions should be calculated.

- each condition have two metric :

- count(*) AND count(distinct userId)

CountryGenderEventEvent CountUser Count
US--21
USF- 22
USFA11
USFB11
F32
FA11
FB11
A33
B11
CA11
CAF11
CAFA11
IN11
INM11
INMA11
M11
MA11
INA11

I have successfully did it with hadoop mapreduce.

But. As raw data size getting large MR job takes too long time.!!!!

( 300 millions logs takes 6 hours )

Of course it can be generated by multiple spark sql queries

- but we have 5~10 dimensions in real world.

- and It's cost seems not cheep two.

# 1 dimensions
select country, count(*), count(distinct userId) from TABLE group by country
select gender, count(*), count(distinct userId) from TABLE group by gender
select event, count(*), count(distinct userId) from TABLE group by event

# 2 dimensions
select country, event, count(*), count(distinct userId) from TABLE group by country, event
select country, gender, count(*), count(distinct userId) from TABLE group by country, gender
select gender, event, count(*), count(distinct userId) from TABLE group by gender, event

# 3 dimensions
select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event

So I consider to convert this MR job to Spark's something.

I have never found some examples fit to solve this problem.

Is there any good reference about this kind works? Any Idea?

( I know Druid provide similar function to this but is is not an option for us for some reasons.)

Thanks!

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Add "WITH ROLLUP" or "WITH CUBE" to the end of the query, like:

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH ROLLUP

or

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH CUBE

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Add "WITH ROLLUP" or "WITH CUBE" to the end of the query, like:

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH ROLLUP

or

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH CUBE

avatar
Contributor

OMG.

"With Rollup" is definitely what I need.

I have never seen this query option.

Looks almost sql engines supports that.

Great thanks.

avatar
Contributor

By any chance,

Does OLAP functions(cube,rollup,grouping sets) supports only HiveSqlContext in SparkSQL?

scala> sql=new org.apache.spark.sql.SQLContext(sc)
scala> var json_df = sql.read.json("/data/jpl/band/2016/10/06/raw_json_tiny/*/*")
scala > json_df.registerTempTable("json_tb")
scala> var result = sql.sql("select service_id, product, os_name, action_id, classifier, language, country, app_ver, count(*), count(distinct device_id), count(distinct user_key) from json_tb group by service_id, product, os_name, action_id, classifier, language, country, app_ver with rollup")

java.lang.RuntimeException: [1.253] failure: ``union'' expected but `with' found

avatar
Contributor

With spark 1.6

rollup/cube/groupingsets are not possible by query syntax but possible dataframe api.

this works.

var agg_result = json_df.select("scene_id","action_id","classifier","country","os_name","app_ver","user_key","device_id").cube("scene_id","action_id","classifier","country","os_name","app_ver").agg(count("user_key"), countDistinct("user_key"), countDistinct("device_id"))