Member since
04-25-2016
19
Posts
4
Kudos Received
0
Solutions
10-07-2016
02:42 AM
With spark 1.6 rollup/cube/groupingsets are not possible by query syntax but possible dataframe api. this works. var agg_result = json_df.select("scene_id","action_id","classifier","country","os_name","app_ver","user_key","device_id").cube("scene_id","action_id","classifier","country","os_name","app_ver").agg(count("user_key"), countDistinct("user_key"), countDistinct("device_id"))
... View more
10-07-2016
01:49 AM
By any chance, Does OLAP functions(cube,rollup,grouping sets) supports only HiveSqlContext in SparkSQL? scala> sql=new org.apache.spark.sql.SQLContext(sc)
scala> var json_df = sql.read.json("/data/jpl/band/2016/10/06/raw_json_tiny/*/*")
scala > json_df.registerTempTable("json_tb")
scala> var result = sql.sql("select service_id, product, os_name, action_id, classifier, language, country, app_ver, count(*), count(distinct device_id), count(distinct user_key) from json_tb group by service_id, product, os_name, action_id, classifier, language, country, app_ver with rollup")
java.lang.RuntimeException: [1.253] failure: ``union'' expected but `with' found
... View more
10-06-2016
01:55 PM
OMG. "With Rollup" is definitely what I need. I have never seen this query option. Looks almost sql engines supports that. Great thanks.
... View more
10-06-2016
08:47 AM
Hello. With this table, Country Gender Event User US F A #1 US F B #1 CA F A #2 IN M A #3 It is simple to get 3-dimension olap cube data by select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event
https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html
But it's result is not what I expected. What I've already done by Hadoop-M/R Jobs is like below. - every combination of dimensions should be calculated. - each condition have two metric : - count(*) AND count(distinct userId) Country Gender Event Event Count User Count US - - 2 1 US F -
2 2 US F A 1 1 US F B 1 1 F 3 2 F A 1 1 F B 1 1 A 3 3 B 1 1 CA 1 1 CA F 1 1 CA F A 1 1 IN 1 1 IN M 1 1 IN M A 1 1 M 1 1 M A 1 1 IN A 1 1 I have successfully did it with hadoop mapreduce. But. As raw data size getting large MR job takes too long time.!!!! ( 300 millions logs takes 6 hours ) Of course it can be generated by multiple spark sql queries - but we have 5~10 dimensions in real world. - and It's cost seems not cheep two. # 1 dimensions
select country, count(*), count(distinct userId) from TABLE group by country
select gender, count(*), count(distinct userId) from TABLE group by gender
select event, count(*), count(distinct userId) from TABLE group by event
# 2 dimensions
select country, event, count(*), count(distinct userId) from TABLE group by country, event
select country, gender, count(*), count(distinct userId) from TABLE group by country, gender
select gender, event, count(*), count(distinct userId) from TABLE group by gender, event
# 3 dimensions
select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event So I consider to convert this MR job to Spark's something. I have never found some examples fit to solve this problem. Is there any good reference about this kind works? Any Idea? ( I know Druid provide similar function to this but is is not an option for us for some reasons.) Thanks!
... View more
Labels:
- Labels:
-
Apache Spark
06-14-2016
08:55 AM
It turned out to be log level problem.(It had set to debug). And I will also try to increase mapper count. ( each container's heap size set to high enough) Thanks @Benjamin Leonhardi btw, Is there any document which explains each counter name and value in detail?
... View more
06-14-2016
05:22 AM
Hello. I have a query like SELECT
a
, b
, c
, d
, count(1) e
FROM
XXX
WHERE
created_date = '20160613'
AND a is not null
GROUP BY
a
, b
, c
, d And It took 17 minutes in total which several times longer than I expected. This is the result of TEZ DAG. Most of the time spent on Reduce job(15 minutes) . And this is the task counter of one of the reducer task. With this data could you please point out the time consuming factor of the reduce task? Simply give more reducer will reduce total DAG time? Hive 1.2, Hadoop 2, Tez 0.6 using. and the table format of hive is AVRO.
... View more
Labels:
- Labels:
-
Apache Hive
05-10-2016
01:20 AM
Apache Ambari
Version2.2.1.0 using!
... View more
05-10-2016
01:18 AM
It works perfectly!
thanks a lot.
... View more
04-25-2016
12:35 AM
Hi. I've found that with Ambari, hive.aux.jars.path configuration is very weird. There are some problem. 1. custom hive-site configuration not working. even though I add "hive.aux.jars.path" to hive-site configuration, it will ignored. 2. hive-env.sh template editing. if I add this line to hive-env.sh template export HIVE_AUX_JARS_PATH="$HIVE_AUX_JARS_PATH,hdfs:///user/elasticsearch/elasticsearch-hadoop-2.3.0.jar" Hive server executed wrong hive.aux.jars.path. $ ps -ef|grep java | grep erver2
hive 25116 1 2 21:38 ? 00:00:26 /usr/jdk64/jdk1.8.0_60/bin/java -Xmx1024m -Dhdp ........ org.apache.hive.service.server.HiveServer2 --hiveconf hive.aux.jars.path=file:///usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar ,file:// hdfs,file://///user/band_dev/elasticsearch/elasticsearch-hadoop-2.3.0.jar I have tried many configuration but No lucky. Why ambari make hdfs:///user/elasticsearch/elasticsearch-hadoop-2.3.0.jar to file:///usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar ,file:// hdfs,file://///user/band_dev/elasticsearch/elasticsearch-hadoop-2.3.0.jar close up ,file:// hdfs,file://///user/b...
this might be
,hdfs:///user/b... If I can not use hive.aux.jar.path to hive-site configuration, at least I want hive-env.sh to work with hdfs locaiton. If then I do not have to locate jar to every hive server(with HA configured). Any Idea?
... View more
Labels:
- Labels:
-
Apache Ambari
-
Apache Hive