Created on 03-28-2017 02:30 AM - edited 09-16-2022 04:21 AM
I need to add hourly partition in kudu table , for eg a similar create table in impala for parquet table would be :
CREATE TABLE tbl_impala_parquet( col1 int, col2 String, col3 String) PARTITIONED BY ( hour TINYINT, day TINYINT, month TINYINT, year INT ) STORED AS PARQUET;
I could have used range partition on hour with splits from 0-23; but will need year, month, day and then hour as partition. Any suggestion will be greatly appreciated.
Regards,
Kundan
Created 03-28-2017 10:54 AM
Kudu tables can be range partitioned on multiple columns, though the syntax is slightly different than the Parquet syntax. Take a look at https://kudu.apache.org/docs/kudu_impala_integration.html#_specifying_tablet_partitioning to see what I mean.
https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_tables also has much more information.
Created 04-14-2017 08:43 AM
Here is a simple exmaple of month and year wise partitioned table in kudu, thoush partition is not dynamic and will have to add more partition after the existing partitions exhausts:
CREATE TABLE sample_table (year INT,month INT,day INT,some_id INT,some_value INT, PRIMARY KEY(year,month,day,some_id) ) PARTITION BY HASH (month) PARTITIONS 12, RANGE(year) ( PARTITION VALUE = (2017), PARTITION VALUE = (2018), PARTITION VALUE = (2019), PARTITION VALUE = (2020) ) STORED AS KUDU;
Note: The partiton by column has to be declared in the first columns while createing tables, i didnt get the reason where as in parquet it had to be declared in the last.
This will create 48 partitions for replication of 3: (12*3)+(4*3)
After the year 2020 will have to add new partitons :
ALTER TABLE sales_by_year ADD RANGE PARTITION VALUE = 2021;
Created 04-14-2017 10:35 AM
It's also possible to range partition on multiple columns, so you could potentially skip the hash partitions if you wanted (although it requires specifying many more explicit range partitions):
CREATE TABLE sample_table (year INT,month INT,day INT,some_id INT,some_value INT, PRIMARY KEY(year,month,day,some_id))
RANGE(year, month) (
PARTITION VALUE = (2017, 01),
PARTITION VALUE = (2017, 02),
PARTITION VALUE = (2017, 03),
PARTITION VALUE = (2017, 04),
PARTITION VALUE = (2017, 05),
PARTITION VALUE = (2017, 06),
PARTITION VALUE = (2017, 07),
PARTITION VALUE = (2017, 08),
PARTITION VALUE = (2017, 09),
PARTITION VALUE = (2017, 10),
PARTITION VALUE = (2017, 11),
PARTITION VALUE = (2017, 12),
PARTITION VALUE = (2018, 01),
PARTITION VALUE = (2018, 02),
) STORED AS KUDU;
I think the reason you had to specify the year and month up front in the column list is that currently Kudu tables have a restriction that the primary key columns must come first in the column list.
Created 01-31-2024 10:43 PM
Hi @Dan Burkert , i have tried exactly the same DDL and it give the error. I used hue/impala to create the table. Can you help?
Created 05-24-2022 06:07 AM
Is there any way in Kudu, dynamic partitions can be created which can automatically add new partition when the new date(month) is available.
e.g. in the above post
RANGE(year) ( PARTITION VALUE = (2017), PARTITION VALUE = (2018), PARTITION VALUE = (2019), PARTITION VALUE = (2020) )
Values after 2020 will get rejected, any workaround so create partitions based on interval
PARTITION BY RANGE (MY_DATE)
INTERVAL( NUMTODSINTERVAL(1,'YEAR')) -- Syntax from a famous relational DB