Reply
Explorer
Posts: 116
Registered: ‎09-14-2016

impala query issue

Hi,

 

I know this is a generic question, but hope if you can give some suggestion.

 

We have two tables that capture the change history, in one case we need the latest, so we create two views to get the max modified date, and join these two views which is slow.

 

One approach maybe to create/copy new tables that have the latest, but need scheduling to copy the newer data which introduce new complexitity.

 

Thanks

Shannon 

Highlighted
Posts: 376
Topics: 11
Kudos: 58
Solutions: 32
Registered: ‎09-02-2016

Re: impala query issue

@ponypony

 

I hope you have both history and latest records on the same table and need to query the latest records for your business needs, and the latest record will be identified by the max modified date.

 

This scenario is called "Change data capture (CDC)" and this can be done in may different ways "using separate job" and query as follows

Ex:

1. Set up an indicator (yes/no) for the latest record  and query all the record with indicator=yes to get the latest record instead of max(modified_date). This will be faster.

2. Increase the version number and get the max version no, instead of max modified_date

3. setup the start and end date for each version history and latest record will NOT have end date, so you can filter with end_date is NULL to get the latest record. 

4. You can refer few more methods from this link

5. Finally, you can also choose the option that you have mentioned (create a new table with latest record). if you have enough space to maintain duplicate records

 

https://en.wikipedia.org/wiki/Change_data_capture

 

NOTE: Pls understand the difference before proceed

a. Applying max(modified_date) is easy to write query but execution takes more time

b. To apply CDC, you need to think proper logic and make sure logic will not break in differenent scenarios and run a separate job to refresh the logic. But exectuion will be faster.

Explorer
Posts: 116
Registered: ‎09-14-2016

Re: impala query issue

Thanks, the issue is the source data does not capture history, we are using kafka to get the updates by modified date, so it is hard to put version/etc.

 

Shannon

Explorer
Posts: 116
Registered: ‎09-14-2016

Re: impala query issue

Also our modified date is saved as bigint.

Announcements