Community Articles

VidyaSargur · ‎11-01-2023

Kudu: Kudu is an open-source distributed storage engine designed for big data [TB and PB]. Kudu is designed for random access and analytical queries on structured data. It supports scans, random access, and lookups.

Although it supports full table scans as an OLAP database and returns results super fast, you should always try to do partition pruning in queries in case you don’t need the entire data set. This will result in fewer scans to the backend server, which in turn will increase the capacity of our Kudu database to handle more queries.

Terminology:

Kudu backend servers, where data resides on disks, are called Tablet Servers.
Kudu tables are partitioned, and each partition is stored in a tablet within the tablet server.

Types of Partition in Kudu:

Hash
Generally a numeric or alphanumeric value
Range
Multilevel(Hash + Range)

To check table partitioning:

Open up our SQL Editor and execute.
>> Show create table database.tablename.
This command will give you table definition. Last few lines like below will show you the partition. In the following example there is a range partition by column year:

To check values of the range partition column:

*****as in this example table partition is range.****

Open up our SQL Editor and execute.
>> Show range partition database.tablename.
Output will be the partition value of _dp_ingesttimestamp column. Below is example of monthly partition.

How will the use of the Partition column in query help ?

In the following example query Eleven, though the row produced is 6, this will still do full table scan and filter out rows from the Kudu side.

select * from default.sample where some_id = 123;

After you run the query, go to can verify from summary. You will see 3 hosts were scanned to get the results. [In this example, there are only 5 tablet servers]

Highlighted section shows how many backend servers it used. In this example it used 3.

We can modify same query by hitting the partition like below:

select * from default.sample where year = 2023;

Above query will result in the same set of data but will not scan all Kudu servers. In the above example it only scanned 1 servers.

Some key things to consider.

Do not cast or use any function to partition column .
Partition column needs to be used as it is to do partition pruning. For example if we change the above query to:
```
select * from default.sample where cast(year as string)  = “2023”;
```

Above query instead of doing scan on one tablet server it will do 3 tablet server scan.

Cloudera Community

Community Articles

How to optimize IMPALA/KUDU queries

Apache Impala

Apache Kudu

Cloudera Data Platform (CDP)