Member since
04-14-2022
5
Posts
0
Kudos Received
0
Solutions
11-01-2023
06:08 PM
Kudu: Kudu is an open-source distributed storage engine designed for big data [TB and PB]. Kudu is designed for random access and analytical queries on structured data. It supports scans, random access, and lookups.
Although it supports full table scans as an OLAP database and returns results super fast, you should always try to do partition pruning in queries in case you don’t need the entire data set. This will result in fewer scans to the backend server, which in turn will increase the capacity of our Kudu database to handle more queries.
Terminology:
Kudu backend servers, where data resides on disks, are called Tablet Servers.
Kudu tables are partitioned, and each partition is stored in a tablet within the tablet server.
Types of Partition in Kudu:
Hash
Generally a numeric or alphanumeric value
Range
Multilevel(Hash + Range)
To check table partitioning:
Open up our SQL Editor and execute.
>> Show create table database.tablename.
This command will give you table definition. Last few lines like below will show you the partition. In the following example there is a range partition by column year:
To check values of the range partition column:
*****as in this example table partition is range.****
Open up our SQL Editor and execute.
>> Show range partition database.tablename.
Output will be the partition value of _dp_ingesttimestamp column. Below is example of monthly partition.
How will the use of the Partition column in query help ?
In the following example query Eleven, though the row produced is 6, this will still do full table scan and filter out rows from the Kudu side.
select * from default.sample where some_id = 123;
After you run the query, go to can verify from summary. You will see 3 hosts were scanned to get the results. [In this example, there are only 5 tablet servers]
Highlighted section shows how many backend servers it used. In this example it used 3.
We can modify same query by hitting the partition like below:
select * from default.sample where year = 2023;
Above query will result in the same set of data but will not scan all Kudu servers. In the above example it only scanned 1 servers.
Some key things to consider.
Do not cast or use any function to partition column .
Partition column needs to be used as it is to do partition pruning. For example if we change the above query to: select * from default.sample where cast(year as string) = “2023”;
Above query instead of doing scan on one tablet server it will do 3 tablet server scan.
... View more
10-30-2023
02:46 PM
Kudu Command Line Copy: The Copy Table command can be used to copy one Kudu table to another. The two tables could be in the same cluster or not. The two tables must have the same table schema but could have different partition schemas. Alternatively, the tool can create the new table using the same table and partition schema as the source table.
Example:
Full table copy command:
kudu table copy <master_addresses> <source_table_name> <dest_master_addresses> -dst_table=<table_name> -write_type=upsert -num_threads=3
Incremental copy command:
kudu table copy <master_addresses> <source_table_name> <dest_master_addresses> -dst_table=<table_name> -write_type=upsert -num_threads=3 -predicates='["AND", [">=", "some_value", 234]]'
Spark Backup Utility:
Kudu supports both full and incremental table backups via a job implemented using Apache Spark. Additionally, it supports restoring tables from full and incremental backups via a restore job implemented using Apache Spark.
Example:
Backup:
spark-submit \
--driver-cores 1 \
--driver-memory 1G \
--executor-cores 3 \
--executor-memory 1G \
--master yarn \
--name KuduBackup_Job1 \
--class org.apache.kudu.backup.KuduBackup /opt/cloudera/parcels/CDH/lib/kudu/kudu-backup2_2.11.jar \
--kuduMasterAddresses xxx.xx.xxx.73 \
--rootPath hdfs:///user/root \
default.sample
Restore:
spark-submit \
--driver-cores 1 \
--driver-memory 1G \
--executor-cores 2 \
--executor-memory 1G \
--master yarn \
--name KuduRestore_Job1 \
--class org.apache.kudu.backup.KuduRestore /opt/cloudera/parcels/CDH/jars/kudu-backup2_2.11-1.15.0.7.1.7.0-551.jar \
--kuduMasterAddresses xxx.xx.xxx.136 \
--rootPath hdfs:///user/root \
--createTables false \
--newDatabaseName spark_copy default.sample
KUDU Spark Backup-Restore Incremental Scenarios
Scenario
Backup
Restore
Inserting New Rows
New Partition will be created for Incremental Rows
Incremental Data Will be Loaded
Updating row value
New Partition will be created for Incremental/Updated Rows
Incremental/Updated Data will be loaded
Changing Column Data Type
Not Supported in KUDU
Not Supported in KUDU
Adding New Column
New Partition will be created for Incremental/Updated Rows. No Full Load required
We need to add Column First and then do Restore
Deleting New Column
New Partition will be created for Incremental/Updated Rows. No Full Load required
We need to delete Column First and then do restore
Deleting a Row
New Partition will be created for Deleted rows. No Full Load required
Rows will be deleted with restore utility
Command line copy Incremental Scenarios
Scenario
Backup
Restore
Inserting New Rows
As long we have timestamp we can do incrementals restore
Incremental Data Will be Loaded
Updating row value
As long as we are updating timestamp value with update incremental copy will work
As long as we are updating timestamp value with update incremental copy will work
Changing Column Data Type
Not Supported in KUDU
Not Supported in KUDU
Adding New Column
We need to add columns first, else incremental copy will fail.
Deleting New Column
We need to delete Column First and then do incremental table copy.
Deleting a Row
Rows Needs to be deleted as soon as we delete rows in the actual table.
... View more