Created 11-27-2024 03:45 AM
Hi at all,
I got a timeseries table with existing range partitions based on timestamp field, that covers years from 2022 to 2024 month by month. The problem is that last partition has an open bound (PARTITION 1717200000000 <= VALUES) and contains data from "1717200000000" (Saturday, June 1, 2024 0:00:00) to now, and is still storing live data with obviously performance loss during query.
From the docs I haven't found a command that lets me split last partition in other partitions in order to update them, but the only way seems to drop last partition and continue adding since this time, and the problem is that by dropping last partition I will lose three and more months of data, that is obviously not possible.
Is there an alternative to prevent data loss and adding range partitions to improve performances?
Thanks in advance
Created 11-27-2024 06:24 AM
@RicSeas Welcome to the Cloudera Community!
To help you get the best possible solution, I have tagged our Kudu experts @ChrisGe @rblough who may be able to assist you further.
Please keep us updated on your post, and we hope you find a satisfactory solution to your query.
Regards,
Diana Torres,Created 11-27-2024 12:50 PM
@RicSeas this is Ryan, one of the Kudu SMEs. You are correct, there is no method for splitting an existing range partition with an open bound. That being said, with more work we can avert the data loss and prevent any such issue arising in the future. To do this we will need a new table.
1. First, create a new table. We want to use the same schema, but we will make changes to the partition schema.
2. Partition schema change one, don't include the unbounded range partition, only include bounded range partitions. You can continue to add these via the kudu table add_range_partition command in the future; you can also pre-create these ranges some distance into the future.
3. Partition schema change two, add a hash partition, which means the table will be using Multilevel Partitioning. Hash partitioning happens by column, so this means that writes can be divided up by hash, even in the same timestamp. This will split the load for the active range among the hash partitions.
4. Once you have created the new table with a multilevel partition schema, use the kudu table copy command to write the data from the current table to the new table. This will automatically sort the data into the new partition schema.
5. When the table data has been copied successfully, delete the current table, and then rename the new table to match the current table.
Once you have completed this and resume the workload, none of the jobs or clients will notice anything different except the improved performance.