About jeesim2

jeesim2 · ‎10-07-2016

With spark 1.6 rollup/cube/groupingsets are not possible by query syntax but possible dataframe api. this works. var agg_result = json_df.select("scene_id","action_id","classifier","country","os_name","app_ver","user_key","device_id").cube("scene_id","action_id","classifier","country","os_name","app_ver").agg(count("user_key"), countDistinct("user_key"), countDistinct("device_id"))

jeesim2 · ‎10-07-2016

By any chance, Does OLAP functions(cube,rollup,grouping sets) supports only HiveSqlContext in SparkSQL? scala> sql=new org.apache.spark.sql.SQLContext(sc) scala> var json_df = sql.read.json("/data/jpl/band/2016/10/06/raw_json_tiny/*/*") scala > json_df.registerTempTable("json_tb") scala> var result = sql.sql("select service_id, product, os_name, action_id, classifier, language, country, app_ver, count(*), count(distinct device_id), count(distinct user_key) from json_tb group by service_id, product, os_name, action_id, classifier, language, country, app_ver with rollup") java.lang.RuntimeException: [1.253] failure: ``union'' expected but `with' found

jeesim2 · ‎10-06-2016

OMG. "With Rollup" is definitely what I need. I have never seen this query option. Looks almost sql engines supports that. Great thanks.

jeesim2 · ‎10-06-2016

Hello. With this table, Country Gender Event User US F A #1 US F B #1 CA F A #2 IN M A #3 It is simple to get 3-dimension olap cube data by select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html But it's result is not what I expected. What I've already done by Hadoop-M/R Jobs is like below. - every combination of dimensions should be calculated. - each condition have two metric : - count(*) AND count(distinct userId) Country Gender Event Event Count User Count US - - 2 1 US F - 2 2 US F A 1 1 US F B 1 1 F 3 2 F A 1 1 F B 1 1 A 3 3 B 1 1 CA 1 1 CA F 1 1 CA F A 1 1 IN 1 1 IN M 1 1 IN M A 1 1 M 1 1 M A 1 1 IN A 1 1 I have successfully did it with hadoop mapreduce. But. As raw data size getting large MR job takes too long time.!!!! ( 300 millions logs takes 6 hours ) Of course it can be generated by multiple spark sql queries - but we have 5~10 dimensions in real world. - and It's cost seems not cheep two. # 1 dimensions select country, count(*), count(distinct userId) from TABLE group by country select gender, count(*), count(distinct userId) from TABLE group by gender select event, count(*), count(distinct userId) from TABLE group by event # 2 dimensions select country, event, count(*), count(distinct userId) from TABLE group by country, event select country, gender, count(*), count(distinct userId) from TABLE group by country, gender select gender, event, count(*), count(distinct userId) from TABLE group by gender, event # 3 dimensions select country, gender, event, count(*), count(distinct userId) from TABLE group by country, gender, event So I consider to convert this MR job to Spark's something. I have never found some examples fit to solve this problem. Is there any good reference about this kind works? Any Idea? ( I know Druid provide similar function to this but is is not an option for us for some reasons.) Thanks!

jeesim2 · ‎06-14-2016

It turned out to be log level problem.(It had set to debug). And I will also try to increase mapper count. ( each container's heap size set to high enough) Thanks @Benjamin Leonhardi btw, Is there any document which explains each counter name and value in detail?

jeesim2 · ‎06-14-2016

Hello. I have a query like SELECT a , b , c , d , count(1) e FROM XXX WHERE created_date = '20160613' AND a is not null GROUP BY a , b , c , d And It took 17 minutes in total which several times longer than I expected. This is the result of TEZ DAG. Most of the time spent on Reduce job(15 minutes) . And this is the task counter of one of the reducer task. With this data could you please point out the time consuming factor of the reduce task? Simply give more reducer will reduce total DAG time? Hive 1.2, Hadoop 2, Tez 0.6 using. and the table format of hive is AVRO.

jeesim2 · ‎05-10-2016

Apache Ambari Version2.2.1.0 using!

jeesim2 · ‎05-10-2016

It works perfectly! thanks a lot.

jeesim2 · ‎04-25-2016

Hi. I've found that with Ambari, hive.aux.jars.path configuration is very weird. There are some problem. 1. custom hive-site configuration not working. even though I add "hive.aux.jars.path" to hive-site configuration, it will ignored. 2. hive-env.sh template editing. if I add this line to hive-env.sh template export HIVE_AUX_JARS_PATH="$HIVE_AUX_JARS_PATH,hdfs:///user/elasticsearch/elasticsearch-hadoop-2.3.0.jar" Hive server executed wrong hive.aux.jars.path. $ ps -ef|grep java | grep erver2 hive 25116 1 2 21:38 ? 00:00:26 /usr/jdk64/jdk1.8.0_60/bin/java -Xmx1024m -Dhdp ........ org.apache.hive.service.server.HiveServer2 --hiveconf hive.aux.jars.path=file:///usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar ,file:// hdfs,file://///user/band_dev/elasticsearch/elasticsearch-hadoop-2.3.0.jar I have tried many configuration but No lucky. Why ambari make hdfs:///user/elasticsearch/elasticsearch-hadoop-2.3.0.jar to file:///usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar ,file:// hdfs,file://///user/band_dev/elasticsearch/elasticsearch-hadoop-2.3.0.jar close up ,file:// hdfs,file://///user/b... this might be ,hdfs:///user/b... If I can not use hive.aux.jar.path to hive-site configuration, at least I want hive-env.sh to work with hdfs locaiton. If then I do not have to locate jar to every hive server(with HA configured). Any Idea?

Online	Offline
Last Visited	‎09-12-2017 08:51 AM

Member Since	‎04-25-2016 12:21 AM
Last Visited	‎09-12-2017 08:51 AM
Posts	19
Kudos received	4

Cloudera Community

Re: Combinational agg over multi-dimensional table

Re: Combinational agg over multi-dimensional table

Re: Combinational agg over multi-dimensional table

Combinational agg over multi-dimensional table

Re: Explanation of Tez task counters.

Explanation of Tez task counters.

Re: Ambari hive's "hive.aux.jars.path" configurati...

Re: Ambari hive's "hive.aux.jars.path" configurati...

Ambari hive's "hive.aux.jars.path" configuration