- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Skipping rows in hive while querying
- Labels:
-
Apache Hive
Created ‎07-11-2018 10:36 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Is there any way we can skip to the next row based on some filter conditions in hive ?
Ex: I have a table landing
hive> desc landing; OK id string iud string time timestamp
content
1 I 2018-07-11 07:10:17 1 D 2018-07-11 07:11:16 2 I 2018-07-11 07:29:17 2 D 2018-07-11 07:31:21 3 I 2018-07-11 08:22:22 3 U 2018-07-11 09:12:22 5 I 2018-07-11 09:22:22
I want to insert into table B with following conditions --
When the
1) id has D flag then it should not be inserted (Skipped) ,
2) when it has only U or I flag then insert ,
3) when id has I,U followed by D flag then it should not insert (Skip)
Output
3 U 2018-07-11 09:12:22 5 I 2018-07-11 09:22:22
I have written a code which is using row_number to eliminate duplicate records , but not satisfying the above conditions.
select * from (select *,row_number() over(partition by id order by time desc) AS rowid FROM landing where iud <> 'D') r where r.rowid=1;
1 I 2018-07-11 07:10:17 1 2 I 2018-07-11 07:29:17 1 3 U 2018-07-11 09:12:22 1 5 I 2018-07-11 09:22:22 1
Help is appreciated .
Created ‎07-11-2018 02:08 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you would probably want to take a look at the "MERGE" statement;
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Merge
but afaik it would need to only keep the last operation on the row to make it work; which is what you are also suggesting (I,U,D = D)...
-- merge is only supported on transactional tables set hive.support.concurrency=true; set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; drop table if exists target; drop table if exists upd; create table target (id string, t timestamp) stored as orc tblproperties('transactional'='true'); create table upd (id string, iud string, t timestamp); insert into target values ('1','2001-11-11'), ('10','2001-11-11'); insert into upd values ('1','D','2001-11-11'), ('2','U','2001-10-11'), ('2','D','2001-11-11'), ('2','I','2001-12-11'), ('3','U','2001-10-11'), ('3','D','2001-11-11'), ('4','I','2001-11-11'), ('10','D','2001-11-11'); -- the following will filter the "final" updates select * from ( select *,LEAD(id,1) over (partition by id order by t) as lead from upd order by t ) tt where lead is null; -- this merge statement will bake the updates into the target table merge into target as ta using ( select * from ( select *,LEAD(id,1) over (partition by id order by t) as lead from upd order by t ) tt where lead is null ) as up ON ta.id=up.id when matched and up.iud='D' then DELETE when matched and up.iud='U' then update set t=up.t when not matched and up.iud='I' then insert values (up.id,up.t);
for more info about using LEAD in this query: https://sqlsunday.com/2016/04/11/last-row-per-group/
