Engineering Blogs

Announcements
Now Live: Explore expert insights and technical deep dives on the new Cloudera Community BlogsRead the Announcement

Cloudera Lakehouse Optimizer - Refining Optimizations through Apache Iceberg Table Properties

avatar
Cloudera Employee

In the world of Apache Iceberg, managing table maintenance efficiently is crucial for optimal performance and cost control. Cloudera Lakehouse Optimizer automatically manages your tables when you associate them with maintenance policies. These policies represent your maintenance intentions—defining which tables need attention and how they should be optimized.

The system includes ready-to-use template policies shipped with the product, such as the ClouderaAdaptive policy. Each policy has two components: Constants (.json) and Script (.jexl), along with two execution models:

  1. Schedule Based Policies:

    Run maintenance during defined time windows using familiar cron expressions; must be used for predictable workloads and planned maintenance windows. A schedule-based policy must have a valid cron expression defined.

  2. Event Based Policies:

    Automatically trigger maintenance when actual data changes occur. Event-based policies should have a cron defined as ‘event’ rather than a standard time-based cron expression.

Policy Interaction in Cloudera Lakehouse Optimizer.Policy Interaction in Cloudera Lakehouse Optimizer.

While traditional approaches often require complex configurations and external systems, there is a powerful yet elegant solution hiding in plain sight: Iceberg Table Properties. This feature allows you to configure sophisticated maintenance policies directly at the table level, providing unprecedented flexibility and control.

We'll explore two approaches to Iceberg table maintenance:

  1. Policy Association: How to link your tables to maintenance policies using simple table properties.
  2. Custom Configuration: How to fine-tune maintenance operations with granular table-level settings.

Imagine being able to configure your table's maintenance behavior as easily as setting any other table property. With Iceberg's flexible property system, this becomes reality. Whether you need aggressive compaction for high-write tables or conservative snapshot retention for audit tables, table properties provide the perfect mechanism for table-specific customization.

Cloudera Lakehouse Optimizer (CLO) gives the highest precedence to table properties for any maintenance job. This is how the ClouderaAdaptive policy (JEXL part) is scripted. When a user defines a property at the table level and references it in the script, those properties are accounted for during the evaluation of maintenance jobs.


Policy Association: Linking Tables to Maintenance Strategies

The dlm_policies Property: Your Gateway to Automated Maintenance

The foundation of property-driven maintenance lies in the dlm_policies table property. This special property creates the bridge between your tables and the policies that govern their maintenance.

Setting Up Policy Associations

The association is defined through an Iceberg table property aptly named dlm_policies by default. Its value must be a policy name or a comma-delimited sequence of policy names for tables requiring multiple maintenance strategies.

For instance, to subscribe a products table to the ClouderaAdaptive policy, you would use the following:

ALTER TABLE products SET TBLPROPERTIES('dlm_policies'='ClouderaAdaptive');

Multi-Policy Support

If you need comprehensive maintenance coverage, you can associate multiple policies with a single table.

Note: Ensure that policies do not conflict (e.g., do not have two policies that both perform compaction). You should leverage this only if you want different schedules for different actions. For example, if you want compaction to run via an event schedule, but expire snapshots on a scheduled basis and delete orphan files monthly, you can associate three separate policies to define these actions.

ALTER TABLE high_volume_orders SET TBLPROPERTIES('dlm_policies'='ClouderaAdaptive,DataCompaction,SnapshotCleanup');

Customizing the Property Name

The property name itself can be configured through the dlm.iceberg.policy.property configuration variable in Cloudera Manager. This allows organizations to align property naming with their existing conventions.

Example: If dlm.iceberg.policy.property is set to maintenance_policies instead of the default:

-- If configured to use 'maintenance_policies' instead
ALTER TABLE customer_data SET TBLPROPERTIES('maintenance_policies'='ClouderaAdaptive');

Activating Policy Associations

Once you've configured your table properties, simply refresh the policy associations for the system to recognize these changes. This is done through a specialized endpoint call to Cloudera Lakehouse Optimizer:

POST http://{host}/clo/api/v1/policies/IcebergPropertiesNamespace/tables/{namespace}.*/evaluation

Where:

  • {host} is your service host.
  • {namespace} is the Iceberg namespace containing your tables.

This endpoint refreshes the associations for all tables in the specified namespace, ensuring your property-based configurations are active.


Custom Table Properties: Fine-Grained Control

Beyond policy association, Iceberg table properties offer granular control over individual maintenance operations. This approach provides maximum flexibility, allowing you to customize every aspect of table maintenance (expire snapshots, compaction, etc).

Note: These properties must match the properties referenced in your JEXL scripts to be effective.

Small Files Management (Compaction)

If we want our compaction jobs on a table to have 512 MB as the target file size, we can set the table properties. Other properties can be configured similarly as shown below.

-- Set target file size for compacted files (512MB)
ALTER TABLE sales_data SET TBLPROPERTIES('write.target-file-size-bytes'='536870912');

-- Control compaction trigger threshold (compact only if 15% file reduction is possible)
ALTER TABLE sales_data SET TBLPROPERTIES('dlm.rewriteDataFiles.filesCountDrop'='0.15');

-- Set minimum input files required before compaction
ALTER TABLE sales_data SET TBLPROPERTIES('write.bin-packing.min-file-count'='5');

Advanced Compaction Features:

There are other properties that help improve the compaction process for large, frequently updated tables. For example, to ensure compaction jobs commit transactions as they run, set partialProgressEnabled to true. You can also achieve partition pruning by providing a 'where' clause so your most used partitions are kept in an optimal state. Clustering can be achieved using Z-Order.

-- Enable partial progress for large compaction jobs
ALTER TABLE transaction_history SET TBLPROPERTIES('dlm.rewriteDataFiles.partialProgressEnabled'='true');

-- Set maximum commits for partial progress
ALTER TABLE transaction_history SET TBLPROPERTIES('dlm.rewriteDataFiles.partialProgressMaxCommits'='10');

-- Configure Z-Order optimization for specific columns based on your query pattern
ALTER TABLE customer_events SET TBLPROPERTIES('dlm.rewriteDataFiles.zOrderColumns'='customer_id,event_date,region');

-- Set custom WHERE clause for selective compaction
ALTER TABLE audit_logs SET TBLPROPERTIES('dlm.rewriteDataFiles.where'='event_date >= current_date - interval 30 days');

Custom Sorting Configuration:

For tables requiring specific sort orders, you can configure custom sorting with the following:

-- Configure multi-column sorting: (column, ascending, nulls_first)
ALTER TABLE product_catalog SET TBLPROPERTIES('dlm.rewriteDataFiles.sort'='(category true false),(price false true),(product_id true false)');

Manifest File Management

Manifest files play a crucial role in query planning performance. Detailed configuration can be managed via the following properties:

-- Set maximum manifest file count threshold
ALTER TABLE inventory SET TBLPROPERTIES('dlm.rewriteManifest.fileCountMax'='100');

-- Configure target manifest file size (8MB)
ALTER TABLE inventory SET TBLPROPERTIES('write.manifest.target-size-bytes'='8388608');

-- Set small manifest file ratio threshold (50%)
ALTER TABLE inventory SET TBLPROPERTIES('dlm.rewriteManifest.smallFileRatioMax'='0.5');

Snapshot Management

Control how long table snapshots are retained and when they should be expired:

-- Retain minimum number of snapshots
ALTER TABLE financial_reports SET TBLPROPERTIES('dlm.expireSnapshot.retainLast'='50');

-- Set maximum duration between oldest and newest snapshots (5 days in milliseconds)
ALTER TABLE financial_reports SET TBLPROPERTIES('dlm.expireSnapshot.snapshotsDurationDeltaMax'='432000000');

Orphan File Cleanup

Configure automatic cleanup of files no longer referenced by any snapshots. Currently, we do not specify default table properties for this in ClouderaAdaptive, but you can script a new policy to include it.

ALTER TABLE financial_reports SET TBLPROPERTIES('dlm.deleteOrphanFiles.olderThan'='432000000');

Position Delete Files Maintenance

For tables with frequent deletes, configure position delete file rewriting. These properties don't ship with ClouderaAdaptive by default but can be scripted into a new policy.

-- Configure target file size for position delete files (64MB default)
ALTER TABLE high_delete_table SET TBLPROPERTIES('dlm.rewritePositionDelete.targetFileSize'='67108864');

-- Set minimum input files threshold before rewriting (default: 5)
ALTER TABLE high_delete_table SET TBLPROPERTIES('dlm.rewritePositionDelete.minInputFiles'='6');

-- Control concurrent rewrite operations (default: 5)
ALTER TABLE high_delete_table SET TBLPROPERTIES('dlm.rewritePositionDelete.maxConcurrentGroupRewrite'='10');

Practical Configuration Examples

1. High-Frequency Transaction Table

-- Create table optimized for high-frequency transactions
CREATE TABLE transaction_stream (
  transaction_id bigint,
  customer_id string,
  amount decimal(10,2),
  transaction_date timestamp,
  region string
) STORED BY ICEBERG;

-- Configure for aggressive maintenance
ALTER TABLE transaction_stream SET TBLPROPERTIES(
  'dlm_policies'='ClouderaAdaptive',
  'write.target-file-size-bytes'='268435456', -- 256MB for faster compaction
  'dlm.rewriteDataFiles.filesCountDrop'='0.15', -- More aggressive compaction (15% threshold)
  'write.bin-packing.min-file-count'='10', -- Lower threshold for small file count
  'dlm.rewriteDataFiles.zOrderColumns'='customer_id,transaction_date',
  'dlm.expireSnapshot.retainLast'='10' -- Keep fewer snapshots due to high volume
);

2. Analytical Reporting Table

-- Create table optimized for analytical queries
CREATE TABLE monthly_reports (
  report_id string,
  report_month date,
  department string,
  metrics map<string,double>
) STORED BY ICEBERG;

-- Configure for analytical workloads
ALTER TABLE monthly_reports SET TBLPROPERTIES(
  'dlm_policies'='ClouderaAdaptive',
  'write.target-file-size-bytes'='1073741824', -- 1GB files for analytical queries
  'dlm.rewriteDataFiles.filesCountDrop'='0.3', -- Less aggressive compaction
  'dlm.rewriteDataFiles.zOrderColumns'='department,report_month',
  'dlm.expireSnapshot.retainLast'='100', -- Keep more snapshots for historical analysis
  'dlm.rewriteManifest.fileCountMax'='50' -- Lower manifest threshold for fewer files
);

3. Audit and Compliance Table

-- Create table with strict retention requirements
CREATE TABLE audit_trail (
  audit_id string,
  user_id string,
  action string,
  timestamp timestamp,
  details string
) STORED BY ICEBERG;

-- Configure for compliance and audit requirements
ALTER TABLE audit_trail SET TBLPROPERTIES(
  'dlm_policies'='ClouderaAdaptive',
  'dlm.expireSnapshot.retainLast'='200', -- Retain many snapshots for audit
  'dlm.expireSnapshot.snapshotsDurationDeltaMax'='7776000000', -- 90 days retention
  'dlm.rewriteDataFiles.where'='timestamp >= current_date - interval 7 years', -- Only compact recent data
  'write.target-file-size-bytes'='134217728' -- Smaller files (128MB) for granular access
);

Configuration Best Practices

  • Start Conservative, Tune Gradually

    Begin with moderate settings and adjust based on observed performance and storage patterns. If certain table properties are working fine (realized from your query performance), do not modify them.

  • Consider Cost vs. Performance Trade-offs

    More aggressive maintenance provides better query performance but consumes more compute resources.

The following table summarizes common configuration properties and their trade-offs:

Maintenance PropertyConfiguration KeyOptimize for PerformanceOptimize for CostDefault Value
Rewrite Data Files
Target File Sizewrite.target-file-size-bytes
or
dlm.rewriteDataFiles.targetFileSize
256-512 MB
Larger files = fewer files to scan, faster queries
64-128 MB
Smaller files = less data rewritten per compaction
512 MB
Min Input Fileswrite.bin-packing.min-file-count
or
dlm.rewriteDataFiles.minInputFiles
3-5 files
Compact more frequently
10-20 files
Wait for more small files
5
Files Count Drop Ratiodlm.rewriteDataFiles.filesCountDrop0.10 (10%)
Aggressive trigger
0.25-0.30 (25-30%)
Conservative trigger
0.15
Max Concurrent Rewritesdlm.rewriteDataFiles.maxConcurrentRewriteFileGroups50 groups
Faster completion but uses more resources
5-10
Slower but reduced resource consumption
5
Partial Progressdlm.rewriteDataFiles.partialProgressEnabledtrue
Commit incrementally for large tables
false
Single atomic commit
false
Snapshot Management
Snapshot RetentionexpireOlderThan7-14 days
Keep more snapshots for time-travel queries
2-3 days
Minimal retention saves storage
5 days
Retain Last N SnapshotsretainLast10-20 snapshots
More historical versions available
2-3 snapshots
Keep only recent snapshots
5
Clean Expired FilescleanExpiredFilestrue
Remove orphans immediately for optimal performance
true (same)
Storage cleanup is universal
true
Manifest File Management
Manifest Target Sizecommit.manifest.target-size-bytes
or
dlm.rewriteManifest.targetFileSize
16-32 MB
Larger manifests = fewer files to read at planning
4-8 MB
Smaller manifests = less data rewritten
8 MB
Use Cachingdlm.rewriteManifest.useCachingtrue
Cache manifest data for faster rewrites
true (same)
Caching benefits both
true
Small Manifest Ratiodlm.rewriteManifest.smallManifestRatioMax0.3-0.4
Rewrite when 30-40% are small manifests
0.6-0.7
Wait until more manifests need consolidation
0.5
Orphan File Management
Older Than Thresholddlm.deleteOrphanFiles.olderThan1-2 days
Aggressive cleanup if you run delete orphan file operations monthly
14-30 days
if you run them frequently
3 days
Policy Schedule Considerations
SchedulecronEvery day
Frequent maintenance = better query performance
Weekly or monthly (weekly)
Less frequent = lower compute costs
There is no default kept for ClouderaAdaptive. We can only run adhoc evaluation of this policy.

 

  • Partition-Aware Configuration

For partitioned tables, consider partition-specific maintenance:

-- Configure maintenance to focus on recent partitions
ALTER TABLE partitioned_events SET TBLPROPERTIES('dlm.rewriteDataFiles.where'='event_date >= current_date - interval 30 days');

The Future of Table-Driven Maintenance

Table properties represent a paradigm shift toward more decentralized, flexible data management. By embedding maintenance configuration directly in table metadata, we achieve:

  • Portability: Configuration travels with the table.
  • Flexibility: Each table can have unique maintenance requirements.
  • Simplicity: Standard SQL operations for configuration management.
  • Auditability: Configuration changes are tracked in table history.

The combination of policy association and custom property configuration provides the best of both worlds: the convenience of pre-configured maintenance strategies and the flexibility to customize every aspect of table maintenance behavior. As your data grows and requirements evolve, table properties ensure your maintenance strategy can adapt without requiring infrastructure changes or complex reconfigurations.

Ready to implement property-driven maintenance for your Iceberg tables? Start with policy association using the dlm_policies property, then gradually introduce custom configurations as your requirements become more sophisticated. Your future self will thank you for the flexibility and control these simple properties provide.

Reviewed by - Adam Benlemlih, Dipankar Mazumdar, Henri Beistro,Vipin Rai.