Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Bi-temporal data retention in Kudu

avatar
New Contributor

Hello,

 

I would like to store data sets with a business validity and a transcation validity. 

Does it make sense to use Kudu for a bi-temporal data retention? 

 

Best regards

Stefan 

 

 

1 ACCEPTED SOLUTION

avatar
New Contributor

Kudu is an MVCC based data store. As a result, when updates or deletes take place a new row is inserted with a tag that indicates that it's currently the valid version. Subsequent queries will return only the current valid version and out of date versions will be ignored. Periodically, Kudu runs a background maitenance process that removes old versions of the row to reclaim space. This process is called "compaction".

 

Currently, Kudu does not provide guarantees at the table level on how old versions of the row will remain on the sytem before compaction. Kudu does allow the user to specify a system-wide "ancient history mark" that defines how old previous row versions need to be around before they're considered eligible for compaction, but for "temporal table support" I think a more granular configuration is required. By default, the ancient history mark is also set to a low value (15 minutes) in order to agressively reclaim space.

 

In the direct Kudu API, you can specify a timestamp to use when doing a get(), and if this is set to a time in the past you will get the row as it existed at the provided timestamp. This functionality is not currently accessible through supported SQL options (Impala, Spark SQL).

 

So, it's possible to do what you're asking with the limitations:

 

1) You have to use the Kudu API

2) You have to be willing to use the same ancient history mark for the entire system

3) You need to set the ancient history mark to be far enough in the past to be useful for your use case balanced against the extra space requirements of keeping around old row versions

 

It's possible that these caveats could be removed with additional work in Kudu and tools it integrates with, but this is how things work currently.

View solution in original post

1 REPLY 1

avatar
New Contributor

Kudu is an MVCC based data store. As a result, when updates or deletes take place a new row is inserted with a tag that indicates that it's currently the valid version. Subsequent queries will return only the current valid version and out of date versions will be ignored. Periodically, Kudu runs a background maitenance process that removes old versions of the row to reclaim space. This process is called "compaction".

 

Currently, Kudu does not provide guarantees at the table level on how old versions of the row will remain on the sytem before compaction. Kudu does allow the user to specify a system-wide "ancient history mark" that defines how old previous row versions need to be around before they're considered eligible for compaction, but for "temporal table support" I think a more granular configuration is required. By default, the ancient history mark is also set to a low value (15 minutes) in order to agressively reclaim space.

 

In the direct Kudu API, you can specify a timestamp to use when doing a get(), and if this is set to a time in the past you will get the row as it existed at the provided timestamp. This functionality is not currently accessible through supported SQL options (Impala, Spark SQL).

 

So, it's possible to do what you're asking with the limitations:

 

1) You have to use the Kudu API

2) You have to be willing to use the same ancient history mark for the entire system

3) You need to set the ancient history mark to be far enough in the past to be useful for your use case balanced against the extra space requirements of keeping around old row versions

 

It's possible that these caveats could be removed with additional work in Kudu and tools it integrates with, but this is how things work currently.