Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

Skipping rows in hive while querying

Explorer

Hi All,

Is there any way we can skip to the next row based on some filter conditions in hive ?

Ex: I have a table landing

hive> desc landing;
OK
id                      string
iud                     string
time                    timestamp


content

1       I       2018-07-11 07:10:17
1       D       2018-07-11 07:11:16
2       I       2018-07-11 07:29:17
2       D       2018-07-11 07:31:21
3	I	2018-07-11 08:22:22
3	U	2018-07-11 09:12:22
5	I	2018-07-11 09:22:22

I want to insert into table B with following conditions --

When the

1) id has D flag then it should not be inserted (Skipped) ,

2) when it has only U or I flag then insert ,

3) when id has I,U followed by D flag then it should not insert (Skip)

Output

3	U	2018-07-11 09:12:22
5	I	2018-07-11 09:22:22

I have written a code which is using row_number to eliminate duplicate records , but not satisfying the above conditions.

select * from (select *,row_number() over(partition by id order by time desc) AS rowid FROM landing where iud <> 'D') r where r.rowid=1;
1       I       2018-07-11 07:10:17     1
2       I       2018-07-11 07:29:17     1
3	U	2018-07-11 09:12:22     1
5	I	2018-07-11 09:22:22     1

Help is appreciated .

1 REPLY 1

Cloudera Employee

I think you would probably want to take a look at the "MERGE" statement;

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Merge

but afaik it would need to only keep the last operation on the row to make it work; which is what you are also suggesting (I,U,D = D)...

-- merge is only supported on transactional tables
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

drop table if exists target;
drop table if exists upd;

create table target (id string, t timestamp)
        stored as orc 
        tblproperties('transactional'='true');
create table upd (id string, iud string, t timestamp);


insert into target values ('1','2001-11-11'),
                        ('10','2001-11-11');
insert into upd values  ('1','D','2001-11-11'),
                        ('2','U','2001-10-11'),
                        ('2','D','2001-11-11'),
                        ('2','I','2001-12-11'),
                        ('3','U','2001-10-11'),
                        ('3','D','2001-11-11'),
                        ('4','I','2001-11-11'),
                        ('10','D','2001-11-11');

-- the following will filter the "final" updates
select * from (
        select *,LEAD(id,1) over (partition by id order by t) as lead
        from upd
        order by t
        ) tt where lead is null;



-- this merge statement will bake the updates into the target table
merge into target as ta using
        (
                select * from (
                        select *,LEAD(id,1) over (partition by id order by t) as lead
                        from upd
                        order by t
                ) tt where lead is null
        ) as up
ON ta.id=up.id
when matched and up.iud='D' then DELETE
when matched and up.iud='U' then update set t=up.t
when not matched and up.iud='I' then insert values (up.id,up.t);

for more info about using LEAD in this query: https://sqlsunday.com/2016/04/11/last-row-per-group/