- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Converting Lab3 PIG script to HIVE
- Labels:
-
Apache Hive
-
Apache Pig
Created 05-11-2016 04:57 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In Lab3 - using PIG to calculate risk factor, the PIG script mentioned is:
a = LOAD 'geolocation' using org.apache.hive.hcatalog.pig.HCatLoader(); b = filter a by event != 'normal'; c = foreach b generate driverid, event, (int) '1' as occurance; d = group c by driverid; e = foreach d generate group as driverid, SUM(c.occurance) as t_occ; g = LOAD 'drivermileage' using org.apache.hive.hcatalog.pig.HCatLoader(); h = join e by driverid, g by driverid; final_data = foreach h generate $0 as driverid, $1 as events, $3 as totmiles, (float) $3/$1 as riskfactor; store final_data into 'riskfactor' using org.apache.hive.hcatalog.pig.HCatStorer();
I tried doing the same using HIVE. Below is HIVE script:
CREATE TABLE riskfactor_hive STORED AS ORC AS select t2.driverid as driverid, t2.events as events, dm.totmiles as totmiles, dm.totmiles/t2.events as riskfactor from ( select t1.driverid, count(t1.occurance) as events from ( select driverid, event, 1 as occurance from geolocation where event != 'normal' ) t1 group by t1.driverid ) t2 join drivermileage dm on t2.driverid = dm.driverid ;
Can we further optimise the HIVE script to make it run even faster?
Much Thanks!
Created 05-11-2016 02:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Vaibhav YadavI think you can try collapsing the inner query as follows:
select driverid, count(event) from geolocation where event!='normal' group by driverid;
Created 05-11-2016 11:47 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Vaibhav Yadav, I won't comment on the SQL optimisation (perhaps someone else can), but you can do a lot of tuning to improve Hive performance and there are a lot of articles in the community to advise you. First of all, you should be using Tez as the hive execution engine instead of MapReduce and then you can do a lot of further tweaks: see for example this article which provides some advice on tuning Tez.
I'll leave the thread open for others to comment further, but this will definitely get you started!
Created 05-12-2016 04:28 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @Ana Gillan.. I'll try that..
Created 05-11-2016 02:16 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@Vaibhav YadavI think you can try collapsing the inner query as follows:
select driverid, count(event) from geolocation where event!='normal' group by driverid;
Created 05-12-2016 04:29 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @grajagopal. That just worked
