Archives of Support Questions (Read Only)

jeesim2 · ‎10-06-2016

Hello.

With this table,

Country	Gender	Event	User
US	F	A	#1
US	F	B	#1
CA	F	A	#2
IN	M	A	#3

It is simple to get 3-dimension olap cube data by

select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event

https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html

But it's result is not what I expected.

What I've already done by Hadoop-M/R Jobs is like below.

- every combination of dimensions should be calculated.

- each condition have two metric :

- count(*) AND count(distinct userId)

Country	Gender	Event	Event Count	User Count
US	-	-	2	1
US	F	-	2	2
US	F	A	1	1
US	F	B	1	1
	F		3	2
	F	A	1	1
	F	B	1	1
		A	3	3
		B	1	1
CA			1	1
CA	F		1	1
CA	F	A	1	1
IN			1	1
IN	M		1	1
IN	M	A	1	1
	M		1	1
	M	A	1	1
IN		A	1	1

I have successfully did it with hadoop mapreduce.

But. As raw data size getting large MR job takes too long time.!!!!

( 300 millions logs takes 6 hours )

Of course it can be generated by multiple spark sql queries

- but we have 5~10 dimensions in real world.

- and It's cost seems not cheep two.

# 1 dimensions
select country, count(*), count(distinct userId) from TABLE group by country
select gender, count(*), count(distinct userId) from TABLE group by gender
select event, count(*), count(distinct userId) from TABLE group by event

# 2 dimensions
select country, event, count(*), count(distinct userId) from TABLE group by country, event
select country, gender, count(*), count(distinct userId) from TABLE group by country, gender
select gender, event, count(*), count(distinct userId) from TABLE group by gender, event

# 3 dimensions
select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event

So I consider to convert this MR job to Spark's something.

I have never found some examples fit to solve this problem.

Is there any good reference about this kind works? Any Idea?

( I know Druid provide similar function to this but is is not an option for us for some reasons.)

Thanks!

clukasik · ‎10-06-2016

Add "WITH ROLLUP" or "WITH CUBE" to the end of the query, like:

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH ROLLUP

or

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH CUBE

View solution in original post

clukasik · ‎10-06-2016

Add "WITH ROLLUP" or "WITH CUBE" to the end of the query, like:

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH ROLLUP

or

select country, gender,event, count(*), count(distinct userId)from TABLE groupby country, gender,event WITH CUBE

jeesim2 · ‎10-06-2016

OMG.

"With Rollup" is definitely what I need.

I have never seen this query option.

Looks almost sql engines supports that.

Great thanks.

jeesim2 · ‎10-07-2016

By any chance,

Does OLAP functions(cube,rollup,grouping sets) supports only HiveSqlContext in SparkSQL?

scala> sql=new org.apache.spark.sql.SQLContext(sc)
scala> var json_df = sql.read.json("/data/jpl/band/2016/10/06/raw_json_tiny/*/*")
scala > json_df.registerTempTable("json_tb")
scala> var result = sql.sql("select service_id, product, os_name, action_id, classifier, language, country, app_ver, count(*), count(distinct device_id), count(distinct user_key) from json_tb group by service_id, product, os_name, action_id, classifier, language, country, app_ver with rollup")

java.lang.RuntimeException: [1.253] failure: ``union'' expected but `with' found

jeesim2 · ‎10-07-2016

With spark 1.6

rollup/cube/groupingsets are not possible by query syntax but possible dataframe api.

this works.

var agg_result = json_df.select("scene_id","action_id","classifier","country","os_name","app_ver","user_key","device_id").cube("scene_id","action_id","classifier","country","os_name","app_ver").agg(count("user_key"), countDistinct("user_key"), countDistinct("device_id"))

Cloudera Community

Archives of Support Questions (Read Only)

Combinational agg over multi-dimensional table