- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Converting Lab3 PIG script to HIVE
- Labels:
-
Apache Hive
-
Apache Pig
Created ‎05-11-2016 04:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Lab3 - using PIG to calculate risk factor, the PIG script mentioned is:
a = LOAD 'geolocation' using org.apache.hive.hcatalog.pig.HCatLoader(); b = filter a by event != 'normal'; c = foreach b generate driverid, event, (int) '1' as occurance; d = group c by driverid; e = foreach d generate group as driverid, SUM(c.occurance) as t_occ; g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader(); h = join e by driverid, g by driverid; final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor; store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();
I tried doing the same using HIVE. Below is HIVE script:
CREATE TABLE riskfactor_hive STORED AS ORC AS select t2.driverid as driverid, t2.events as events, dm.totmiles as totmiles, dm.totmiles/t2.events as riskfactor from ( select t1.driverid, count(t1.occurance) as events from ( select driverid, event, 1 as occurance from geolocation where event != 'normal' ) t1 group by t1.driverid ) t2 join drivermileage dm on t2.driverid = dm.driverid ;
Can we further optimise the HIVE script to make it run even faster?
Much Thanks!
Created ‎05-11-2016 02:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Vaibhav YadavI think you can try collapsing the inner query as follows:
select driverid, count(event) from geolocation where event!='normal' group by driverid;
Created ‎05-11-2016 11:47 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Vaibhav Yadav, I won't comment on the SQL optimisation (perhaps someone else can), but you can do a lot of tuning to improve Hive performance and there are a lot of articles in the community to advise you. First of all, you should be using Tez as the hive execution engine instead of MapReduce and then you can do a lot of further tweaks: see for example this article which provides some advice on tuning Tez.
I'll leave the thread open for others to comment further, but this will definitely get you started!
Created ‎05-12-2016 04:28 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Ana Gillan.. I'll try that..
Created ‎05-11-2016 02:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Vaibhav YadavI think you can try collapsing the inner query as follows:
select driverid, count(event) from geolocation where event!='normal' group by driverid;
Created ‎05-12-2016 04:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @grajagopal. That just worked
