Support Questions

Bangalore · ‎07-11-2018

Hi All,

Is there any way we can skip to the next row based on some filter conditions in hive ?

Ex: I have a table landing

hive> desc landing;
OK
id                      string
iud                     string
time                    timestamp

content

1       I       2018-07-11 07:10:17
1       D       2018-07-11 07:11:16
2       I       2018-07-11 07:29:17
2       D       2018-07-11 07:31:21
3	I	2018-07-11 08:22:22
3	U	2018-07-11 09:12:22
5	I	2018-07-11 09:22:22

I want to insert into table B with following conditions --

When the

1) id has D flag then it should not be inserted (Skipped) ,

2) when it has only U or I flag then insert ,

3) when id has I,U followed by D flag then it should not insert (Skip)

Output

3	U	2018-07-11 09:12:22
5	I	2018-07-11 09:22:22

I have written a code which is using row_number to eliminate duplicate records , but not satisfying the above conditions.

select * from (select *,row_number() over(partition by id order by time desc) AS rowid FROM landing where iud <> 'D') r where r.rowid=1;

1       I       2018-07-11 07:10:17     1
2       I       2018-07-11 07:29:17     1
3	U	2018-07-11 09:12:22     1
5	I	2018-07-11 09:22:22     1

Help is appreciated .

zhaindrich · ‎07-11-2018

I think you would probably want to take a look at the "MERGE" statement;

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Merge

but afaik it would need to only keep the last operation on the row to make it work; which is what you are also suggesting (I,U,D = D)...

-- merge is only supported on transactional tables
set hive.support.concurrency=true;
set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;

drop table if exists target;
drop table if exists upd;

create table target (id string, t timestamp)
        stored as orc 
        tblproperties('transactional'='true');
create table upd (id string, iud string, t timestamp);


insert into target values ('1','2001-11-11'),
                        ('10','2001-11-11');
insert into upd values  ('1','D','2001-11-11'),
                        ('2','U','2001-10-11'),
                        ('2','D','2001-11-11'),
                        ('2','I','2001-12-11'),
                        ('3','U','2001-10-11'),
                        ('3','D','2001-11-11'),
                        ('4','I','2001-11-11'),
                        ('10','D','2001-11-11');

-- the following will filter the "final" updates
select * from (
        select *,LEAD(id,1) over (partition by id order by t) as lead
        from upd
        order by t
        ) tt where lead is null;



-- this merge statement will bake the updates into the target table
merge into target as ta using
        (
                select * from (
                        select *,LEAD(id,1) over (partition by id order by t) as lead
                        from upd
                        order by t
                ) tt where lead is null
        ) as up
ON ta.id=up.id
when matched and up.iud='D' then DELETE
when matched and up.iud='U' then update set t=up.t
when not matched and up.iud='I' then insert values (up.id,up.t);

for more info about using LEAD in this query: https://sqlsunday.com/2016/04/11/last-row-per-group/

Cloudera Community

Support Questions

Skipping rows in hive while querying

Row vs Columnar Storage For Hive

Ambari Hive View output skips every 101st row

How to Skip header row using Pig

Counting rows in multiple partitions in Hive query

Hive insert query optimization

Loading bucketed table ignores or skips rows

Impala Query on Hive transactional table returns 0...

Hive Query Recovery Tactics: Handling Failures thr...

Optimizing Hive queries for ORC formatted tables

Hive Row Level Access Restriction Using Ranger