Community Articles

Find and share helpful community-sourced technical articles.
avatar
Cloudera Employee

Kudu: Kudu is an open-source distributed storage engine designed for big data [TB and PB]. Kudu is designed for random access and analytical queries on structured data.  It supports scans, random access, and lookups.

Although it supports full table scans as an OLAP database and returns results super fast, you should always try to do partition pruning in queries in case you don’t need the entire data set. This will result in fewer scans to the backend server, which in turn will increase the capacity of our Kudu database to handle more queries.

Terminology:

  • Kudu backend servers, where data resides on disks, are called Tablet Servers.
  • Kudu tables are partitioned, and each partition is stored in a tablet within the tablet server.

Types of Partition in Kudu:

  • Hash
  • Generally a numeric or alphanumeric value
  • Range
  • Multilevel(Hash + Range)

To check table partitioning:

  1. Open up our SQL Editor and execute.
  2. >> Show create table database.tablename.ShantanuGope_0-1698690539703.png
  3. This command will give you table definition. Last few lines like below will show you the partition. In the following example there is a range partition by column year:ShantanuGope_1-1698690539712.png

 

To check values of the range partition column: 

*****as in this example table partition is range.****

  1. Open up our SQL Editor and execute.
  2. >> Show range partition database.tablename.
  3. Output will be the partition value of _dp_ingesttimestamp column. Below is example of monthly partition.ShantanuGope_2-1698690539732.png

How will the use of the Partition column in query help ?

In the following example query Eleven,  though the row produced is 6, this will still do full table scan and filter out rows from the Kudu side.

select * from default.sample where some_id = 123;

After you run the query, go to can verify from summary. You will see 3 hosts were scanned to get the results. [In this example, there are only 5 tablet servers]

ShantanuGope_3-1698690539696.png

Highlighted section shows how many backend servers it used. In this example it used 3.

We can modify same query by hitting the partition like below:

select * from default.sample where year = 2023;

Above query will result in the same set of data but will not scan all Kudu servers. In the above example it only scanned 1 servers.

ShantanuGope_4-1698690539707.pngSome key things to consider.

  • Do not cast or use any function to partition column .
  • Partition column needs to be used as it is to do partition pruning. For example if we change the above query to: 
    select * from default.sample where cast(year as string)  = “2023”;​

Above query instead of doing scan on one  tablet server it will do 3 tablet server scan.

ShantanuGope_5-1698690539626.png

 



934 Views
0 Kudos