Support Questions

Find answers, ask questions, and share your expertise

Skipping rows in hive while querying

avatar
Contributor

Hi All,

Is there any way we can skip to the next row based on some filter conditions in hive ?

Ex: I have a table landing

hive> desc landing;
OK
id                      string
iud                     string
time                    timestamp


content

1       I       2018-07-11 07:10:17
1       D       2018-07-11 07:11:16
2       I       2018-07-11 07:29:17
2       D       2018-07-11 07:31:21
3	I	2018-07-11 08:22:22
3	U	2018-07-11 09:12:22
5	I	2018-07-11 09:22:22

I want to insert into table B with following conditions --

When the

1) id has D flag then it should not be inserted (Skipped) ,

2) when it has only U or I flag then insert ,

3) when id has I,U followed by D flag then it should not insert (Skip)

Output

3	U	2018-07-11 09:12:22
5	I	2018-07-11 09:22:22

I have written a code which is using row_number to eliminate duplicate records , but not satisfying the above conditions.

select * from (select *,row_number() over(partition by id order by time desc) AS rowid FROM landing where iud <> 'D') r where r.rowid=1;
1       I       2018-07-11 07:10:17     1
2       I       2018-07-11 07:29:17     1
3	U	2018-07-11 09:12:22     1
5	I	2018-07-11 09:22:22     1

Help is appreciated .

1 REPLY 1

avatar
Contributor

I think you would probably want to take a look at the "MERGE" statement;

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Merge

but afaik it would need to only keep the last operation on the row to make it work; which is what you are also suggesting (I,U,D = D)...

-- merge is only supported on transactional tables
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

drop table if exists target;
drop table if exists upd;

create table target (id string, t timestamp)
        stored as orc 
        tblproperties('transactional'='true');
create table upd (id string, iud string, t timestamp);


insert into target values ('1','2001-11-11'),
                        ('10','2001-11-11');
insert into upd values  ('1','D','2001-11-11'),
                        ('2','U','2001-10-11'),
                        ('2','D','2001-11-11'),
                        ('2','I','2001-12-11'),
                        ('3','U','2001-10-11'),
                        ('3','D','2001-11-11'),
                        ('4','I','2001-11-11'),
                        ('10','D','2001-11-11');

-- the following will filter the "final" updates
select * from (
        select *,LEAD(id,1) over (partition by id order by t) as lead
        from upd
        order by t
        ) tt where lead is null;



-- this merge statement will bake the updates into the target table
merge into target as ta using
        (
                select * from (
                        select *,LEAD(id,1) over (partition by id order by t) as lead
                        from upd
                        order by t
                ) tt where lead is null
        ) as up
ON ta.id=up.id
when matched and up.iud='D' then DELETE
when matched and up.iud='U' then update set t=up.t
when not matched and up.iud='I' then insert values (up.id,up.t);

for more info about using LEAD in this query: https://sqlsunday.com/2016/04/11/last-row-per-group/