Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Creating dynamic range partition in kudu

avatar
Contributor

I need to add hourly partition in kudu table , for eg a similar create table in impala for parquet table would be :

CREATE TABLE  tbl_impala_parquet( col1 int, col2 String, col3 String)  PARTITIONED BY (   hour TINYINT,    day TINYINT,    month TINYINT,    year INT ) STORED AS PARQUET;

I could have used range partition on hour with splits from 0-23; but will need year, month, day and then hour as partition. Any suggestion will be greatly appreciated.

 

Regards,

Kundan

5 REPLIES 5

avatar
Expert Contributor

Kudu tables can be range partitioned on multiple columns, though the syntax is slightly different than the Parquet syntax. Take a look at https://kudu.apache.org/docs/kudu_impala_integration.html#_specifying_tablet_partitioning to see what I mean.

 

https://kudu.apache.org/docs/kudu_impala_integration.html#partitioning_tables also has much more information.

avatar
Contributor

Here is a simple exmaple of month and year wise partitioned table in kudu, thoush partition is not dynamic and will have to add more partition after the existing partitions exhausts:

CREATE TABLE sample_table (year INT,month INT,day INT,some_id INT,some_value INT, PRIMARY KEY(year,month,day,some_id) )
 PARTITION BY HASH (month) PARTITIONS 12,
 RANGE(year)
(
  PARTITION VALUE = (2017),
  PARTITION VALUE = (2018),
   PARTITION VALUE = (2019),
   PARTITION VALUE = (2020)
  
) 
 STORED AS KUDU;

Note: The partiton by column has to be declared in the first columns while createing tables, i didnt get the reason where as in parquet it had to be declared in the last.
This will create 48 partitions for replication of 3: (12*3)+(4*3)
After the year 2020 will have to add new partitons :

ALTER TABLE sales_by_year ADD RANGE PARTITION VALUE = 2021;
 

avatar
Contributor

It's also possible to range partition on multiple columns, so you could potentially skip the hash partitions if you wanted (although it requires specifying many more explicit range partitions):

 

CREATE TABLE sample_table (year INT,month INT,day INT,some_id INT,some_value INT, PRIMARY KEY(year,month,day,some_id))
RANGE(year, month) (
PARTITION VALUE = (2017, 01),
PARTITION VALUE = (2017, 02),
PARTITION VALUE = (2017, 03),
PARTITION VALUE = (2017, 04),
PARTITION VALUE = (2017, 05),
PARTITION VALUE = (2017, 06),
PARTITION VALUE = (2017, 07),
PARTITION VALUE = (2017, 08),
PARTITION VALUE = (2017, 09),
PARTITION VALUE = (2017, 10),
PARTITION VALUE = (2017, 11),
PARTITION VALUE = (2017, 12),
PARTITION VALUE = (2018, 01),
PARTITION VALUE = (2018, 02),
) STORED AS KUDU;

 

I think the reason you had to specify the year and month up front in the column list is that currently Kudu tables have a restriction that the primary key columns must come first in the column list.

avatar
Rising Star

Hi @Dan Burkert , i have tried exactly the same DDL and it give the error. I used hue/impala to create the table. Can you help?

  • ParseException: Syntax error in line 4:undefined: RANGE(year, month) ( ^ Encountered: RANGE Expected: CACHED, COMMENT, LOCATION, PARTITION, PARTITIONED, PRODUCED, ROW, SORT, STORED, TBLPROPERTIES, UNCACHED, WITH CAUSED BY: Exception: Syntax error
 

avatar
New Contributor

Is there any way in Kudu, dynamic partitions can be created which can automatically add new partition when the new date(month) is available.

 

e.g. in the above post 

RANGE(year)
(
  PARTITION VALUE = (2017),
  PARTITION VALUE = (2018),
   PARTITION VALUE = (2019),
   PARTITION VALUE = (2020)
  
) 

Values after 2020 will get rejected, any workaround so create partitions based on interval

PARTITION BY RANGE (MY_DATE)
INTERVAL( NUMTODSINTERVAL(1,'YEAR'))  -- Syntax from a famous relational DB