Created 05-11-2016 04:57 AM
In Lab3 - using PIG to calculate risk factor, the PIG script mentioned is:
a = LOAD 'geolocation' using org.apache.hive.hcatalog.pig.HCatLoader(); b = filter a by event != 'normal'; c = foreach b generate driverid, event, (int) '1' as occurance; d = group c by driverid; e = foreach d generate group as driverid, SUM(c.occurance) as t_occ; g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader(); h = join e by driverid, g by driverid; final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor; store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();
I tried doing the same using HIVE. Below is HIVE script:
CREATE TABLE riskfactor_hive STORED AS ORC AS select t2.driverid as driverid, t2.events as events, dm.totmiles as totmiles, dm.totmiles/t2.events as riskfactor from ( select t1.driverid, count(t1.occurance) as events from ( select driverid, event, 1 as occurance from geolocation where event != 'normal' ) t1 group by t1.driverid ) t2 join drivermileage dm on t2.driverid = dm.driverid ;
Can we further optimise the HIVE script to make it run even faster?
Much Thanks!
Created 05-11-2016 02:16 PM
@Vaibhav YadavI think you can try collapsing the inner query as follows:
select driverid, count(event) from geolocation where event!='normal' group by driverid;
Created 05-11-2016 11:47 AM
Hi @Vaibhav Yadav, I won't comment on the SQL optimisation (perhaps someone else can), but you can do a lot of tuning to improve Hive performance and there are a lot of articles in the community to advise you. First of all, you should be using Tez as the hive execution engine instead of MapReduce and then you can do a lot of further tweaks: see for example this article which provides some advice on tuning Tez.
I'll leave the thread open for others to comment further, but this will definitely get you started!
Created 05-12-2016 04:28 AM
Thanks @Ana Gillan.. I'll try that..
Created 05-11-2016 02:16 PM
@Vaibhav YadavI think you can try collapsing the inner query as follows:
select driverid, count(event) from geolocation where event!='normal' group by driverid;
Created 05-12-2016 04:29 AM
Thanks @grajagopal. That just worked