Created 10-06-2016 08:47 AM
Hello.
With this table,
Country | Gender | Event | User | |
US | F | A | #1 | |
US | F | B | #1 | |
CA | F | A | #2 | |
IN | M | A | #3 |
It is simple to get 3-dimension olap cube data by
select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html
But it's result is not what I expected.
What I've already done by Hadoop-M/R Jobs is like below.
- every combination of dimensions should be calculated.
- each condition have two metric :
- count(*) AND count(distinct userId)
Country | Gender | Event | Event Count | User Count | |
US | - | - | 2 | 1 | |
US | F | - | 2 | 2 | |
US | F | A | 1 | 1 | |
US | F | B | 1 | 1 | |
F | 3 | 2 | |||
F | A | 1 | 1 | ||
F | B | 1 | 1 | ||
A | 3 | 3 | |||
B | 1 | 1 | |||
CA | 1 | 1 | |||
CA | F | 1 | 1 | ||
CA | F | A | 1 | 1 | |
IN | 1 | 1 | |||
IN | M | 1 | 1 | ||
IN | M | A | 1 | 1 | |
M | 1 | 1 | |||
M | A | 1 | 1 | ||
IN | A | 1 | 1 | ||
I have successfully did it with hadoop mapreduce.
But. As raw data size getting large MR job takes too long time.!!!!
( 300 millions logs takes 6 hours )
Of course it can be generated by multiple spark sql queries
- but we have 5~10 dimensions in real world.
- and It's cost seems not cheep two.
# 1 dimensions select country, count(*), count(distinct userId) from TABLE group by country select gender, count(*), count(distinct userId) from TABLE group by gender select event, count(*), count(distinct userId) from TABLE group by event # 2 dimensions select country, event, count(*), count(distinct userId) from TABLE group by country, event select country, gender, count(*), count(distinct userId) from TABLE group by country, gender select gender, event, count(*), count(distinct userId) from TABLE group by gender, event # 3 dimensions select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event
So I consider to convert this MR job to Spark's something.
I have never found some examples fit to solve this problem.
Is there any good reference about this kind works? Any Idea?
( I know Druid provide similar function to this but is is not an option for us for some reasons.)
Thanks!
Created 10-06-2016 01:29 PM
Add "WITH ROLLUP" or "WITH CUBE" to the end of the query, like:
select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH ROLLUP
or
select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH CUBE
Created 10-06-2016 01:29 PM
Add "WITH ROLLUP" or "WITH CUBE" to the end of the query, like:
select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH ROLLUP
or
select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH CUBE
Created 10-06-2016 01:55 PM
OMG.
"With Rollup" is definitely what I need.
I have never seen this query option.
Looks almost sql engines supports that.
Great thanks.
Created 10-07-2016 01:49 AM
By any chance,
Does OLAP functions(cube,rollup,grouping sets) supports only HiveSqlContext in SparkSQL?
scala> sql=new org.apache.spark.sql.SQLContext(sc) scala> var json_df = sql.read.json("/data/jpl/band/2016/10/06/raw_json_tiny/*/*") scala > json_df.registerTempTable("json_tb") scala> var result = sql.sql("select service_id, product, os_name, action_id, classifier, language, country, app_ver, count(*), count(distinct device_id), count(distinct user_key) from json_tb group by service_id, product, os_name, action_id, classifier, language, country, app_ver with rollup") java.lang.RuntimeException: [1.253] failure: ``union'' expected but `with' found
Created 10-07-2016 02:42 AM
With spark 1.6
rollup/cube/groupingsets are not possible by query syntax but possible dataframe api.
this works.
var agg_result = json_df.select("scene_id","action_id","classifier","country","os_name","app_ver","user_key","device_id").cube("scene_id","action_id","classifier","country","os_name","app_ver").agg(count("user_key"), countDistinct("user_key"), countDistinct("device_id"))