In Lab3 - using PIG to calculate risk factor, the PIG script mentioned is:
a = LOAD 'geolocation' using org.apache.hive.hcatalog.pig.HCatLoader(); b = filter a by event != 'normal'; c = foreach b generate driverid, event, (int) '1' as occurance; d = group c by driverid; e = foreach d generate group as driverid, SUM(c.occurance) as t_occ; g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader(); h = join e by driverid, g by driverid; final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor; store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();
I tried doing the same using HIVE. Below is HIVE script:
CREATE TABLE riskfactor_hive STORED AS ORC AS select t2.driverid as driverid, t2.events as events, dm.totmiles as totmiles, dm.totmiles/t2.events as riskfactor from ( select t1.driverid, count(t1.occurance) as events from ( select driverid, event, 1 as occurance from geolocation where event != 'normal' ) t1 group by t1.driverid ) t2 join drivermileage dm on t2.driverid = dm.driverid ;
Can we further optimise the HIVE script to make it run even faster?
Hi @Vaibhav Yadav, I won't comment on the SQL optimisation (perhaps someone else can), but you can do a lot of tuning to improve Hive performance and there are a lot of articles in the community to advise you. First of all, you should be using Tez as the hive execution engine instead of MapReduce and then you can do a lot of further tweaks: see for example this article which provides some advice on tuning Tez.
I'll leave the thread open for others to comment further, but this will definitely get you started!