About ShantanuGope

ShantanuGope · ‎11-01-2023

Kudu: Kudu is an open-source distributed storage engine designed for big data [TB and PB]. Kudu is designed for random access and analytical queries on structured data. It supports scans, random access, and lookups. Although it supports full table scans as an OLAP database and returns results super fast, you should always try to do partition pruning in queries in case you don’t need the entire data set. This will result in fewer scans to the backend server, which in turn will increase the capacity of our Kudu database to handle more queries. Terminology: Kudu backend servers, where data resides on disks, are called Tablet Servers. Kudu tables are partitioned, and each partition is stored in a tablet within the tablet server. Types of Partition in Kudu: Hash Generally a numeric or alphanumeric value Range Multilevel(Hash + Range) To check table partitioning: Open up our SQL Editor and execute. >> Show create table database.tablename. This command will give you table definition. Last few lines like below will show you the partition. In the following example there is a range partition by column year: To check values of the range partition column: *****as in this example table partition is range.**** Open up our SQL Editor and execute. >> Show range partition database.tablename. Output will be the partition value of _dp_ingesttimestamp column. Below is example of monthly partition. How will the use of the Partition column in query help ? In the following example query Eleven, though the row produced is 6, this will still do full table scan and filter out rows from the Kudu side. select * from default.sample where some_id = 123; After you run the query, go to can verify from summary. You will see 3 hosts were scanned to get the results. [In this example, there are only 5 tablet servers] Highlighted section shows how many backend servers it used. In this example it used 3. We can modify same query by hitting the partition like below: select * from default.sample where year = 2023; Above query will result in the same set of data but will not scan all Kudu servers. In the above example it only scanned 1 servers. Some key things to consider. Do not cast or use any function to partition column . Partition column needs to be used as it is to do partition pruning. For example if we change the above query to: select * from default.sample where cast(year as string) = “2023”; Above query instead of doing scan on one tablet server it will do 3 tablet server scan.

ShantanuGope · ‎10-30-2023

Kudu Command Line Copy: The Copy Table command can be used to copy one Kudu table to another. The two tables could be in the same cluster or not. The two tables must have the same table schema but could have different partition schemas. Alternatively, the tool can create the new table using the same table and partition schema as the source table. Example: Full table copy command: kudu table copy <master_addresses> <source_table_name> <dest_master_addresses> -dst_table=<table_name> -write_type=upsert -num_threads=3 Incremental copy command: kudu table copy <master_addresses> <source_table_name> <dest_master_addresses> -dst_table=<table_name> -write_type=upsert -num_threads=3 -predicates='["AND", [">=", "some_value", 234]]' Spark Backup Utility: Kudu supports both full and incremental table backups via a job implemented using Apache Spark. Additionally, it supports restoring tables from full and incremental backups via a restore job implemented using Apache Spark. Example: Backup: spark-submit \ --driver-cores 1 \ --driver-memory 1G \ --executor-cores 3 \ --executor-memory 1G \ --master yarn \ --name KuduBackup_Job1 \ --class org.apache.kudu.backup.KuduBackup /opt/cloudera/parcels/CDH/lib/kudu/kudu-backup2_2.11.jar \ --kuduMasterAddresses xxx.xx.xxx.73 \ --rootPath hdfs:///user/root \ default.sample Restore: spark-submit \ --driver-cores 1 \ --driver-memory 1G \ --executor-cores 2 \ --executor-memory 1G \ --master yarn \ --name KuduRestore_Job1 \ --class org.apache.kudu.backup.KuduRestore /opt/cloudera/parcels/CDH/jars/kudu-backup2_2.11-1.15.0.7.1.7.0-551.jar \ --kuduMasterAddresses xxx.xx.xxx.136 \ --rootPath hdfs:///user/root \ --createTables false \ --newDatabaseName spark_copy default.sample KUDU Spark Backup-Restore Incremental Scenarios Scenario Backup Restore Inserting New Rows New Partition will be created for Incremental Rows Incremental Data Will be Loaded Updating row value New Partition will be created for Incremental/Updated Rows Incremental/Updated Data will be loaded Changing Column Data Type Not Supported in KUDU Not Supported in KUDU Adding New Column New Partition will be created for Incremental/Updated Rows. No Full Load required We need to add Column First and then do Restore Deleting New Column New Partition will be created for Incremental/Updated Rows. No Full Load required We need to delete Column First and then do restore Deleting a Row New Partition will be created for Deleted rows. No Full Load required Rows will be deleted with restore utility Command line copy Incremental Scenarios Scenario Backup Restore Inserting New Rows As long we have timestamp we can do incrementals restore Incremental Data Will be Loaded Updating row value As long as we are updating timestamp value with update incremental copy will work As long as we are updating timestamp value with update incremental copy will work Changing Column Data Type Not Supported in KUDU Not Supported in KUDU Adding New Column We need to add columns first, else incremental copy will fail. Deleting New Column We need to delete Column First and then do incremental table copy. Deleting a Row Rows Needs to be deleted as soon as we delete rows in the actual table.

Online	Offline
Last Visited	‎12-18-2024 11:15 AM

Member Since	‎04-14-2022 07:40 AM
Last Visited	‎12-18-2024 11:15 AM
Posts	5

Cloudera Community

How to optimize IMPALA/KUDU queries

Comparison : Kudu Copy Command vs Spark backup uti...