Cloudera is happy to announce the availability of parcels and packages for Kudu 1.0.0.
After approximately a year of beta releases, Apache Kudu has reached version 1.0.0. This version number signifies that the development team feels that Kudu is stable enough for usage in production environments.This is the first non-beta release of the Apache Kudu project. (Although because Kudu is not currently integrated into CDH, it is not yet an officially supported CDH component.)
Kudu 1.0.0 delivers a number of new features, bug fixes, and optimizations. We are also releasing a refresh of the Impala Kudu parcel.
In addition, you can read more about Kudu 1.0.0 in the vision blog post we just published.
New Features in Kudu 1.0.0
Removal of multiversion concurrency control (MVCC) history is now supported. This is known as tablet history GC. This allows Kudu to reclaim disk space, where previously Kudu would keep a full history of all changes made to a given table since the beginning of time. Previously, the only way to reclaim disk space was to drop a table.
Kudu will still keep historical data, and the amount of history retained is controlled by setting the configuration flag --tablet_history_max_age_sec, which defaults to 15 minutes (expressed in seconds). The timestamp represented by the current time minus tablet_history_max_age_sec is known as the ancient history mark (AHM). When a compaction or flush occurs, Kudu will remove the history of changes made prior to the ancient history mark. This only affects historical data; currently-visible data will not be removed. A specialized maintenance manager background task to remove existing "cold"historical data that is not in a row affected by the normal compaction process will be added in a future release.
Most of Kudu’s command line tools have been consolidated under a new top-level kudu tool. This reduces the number of large binaries distributed with Kudu and also includes much-improved help output.
The Kudu Flume Sink now supports processing events containing Avro-encoded records, using the newAvroKuduOperationsProducer.
Administrative tools including kudu cluster ksck now support running against multi-master Kudu clusters.
The output of the ksck tool is now colorized and much easier to read.
The C++ client API now supports writing data in AUTO_FLUSH_BACKGROUND mode. This can provide higher throughput for ingest workloads.
The performance of comparison predicates on dictionary-encoded columns has been substantially optimized. Users are encouraged to use dictionary encoding on any string or binary columns with low cardinality, especially if these columns will be filtered with predicates.
The Java client is now able to prune partitions from scanners based on the provided predicates. For example, an equality predicate on a hash-partitioned column will now only access those tablets that could possibly contain matching data. This is expected to improve performance for the Spark integration as well as applications using the Java client API.
The performance of compaction selection in the tablet server has been substantially improved. This can increase the efficiency of the background maintenance threads and improve overall throughput of heavy write workloads.
The policy by which the tablet server retains write-ahead log (WAL) files has been improved so that it takes into account other replicas of the tablet. This should help mitigate the spurious eviction of tablet replicas on machines that temporarily lag behind the other replicas.
Wire protocol compatibility
Kudu 1.0.0 maintains client-server wire-compatibility with previous releases. Applications using the Kudu client libraries may be upgraded either before, at the same time, or after the Kudu servers.
Kudu 1.0.0 does not maintain server-server wire compatibility with previous releases. Therefore, rolling upgrades between earlier versions of Kudu and Kudu 1.0.0 are not supported.
Incompatible Changes in Kudu 1.0.0 Command line tools
The kudu-pbc-dump tool has been removed. The same functionality is now implemented as kudu pbc dump.
The kudu-ksck tool has been removed. The same functionality is now implemented as kudu cluster ksck.
The cfile-dump tool has been removed. The same functionality is now implemented as kudu fs cfile dump.
The log-dump tool has been removed. The same functionality is now implemented as kudu wal dumpand kudu local_replica dump wals.
The kudu-admin tool has been removed. The same functionality is now implemented within kudu tableand kudu tablet.
The kudu-fs_dump tool has been removed. The same functionality is now implemented as kudu fs dump.
The kudu-ts-cli tool has been removed. The same functionality is now implemented within kudu master, kudu remote_replica, and kudu tserver.
The kudu-fs_list tool has been removed and some similar useful functionality has been moved underkudu local_replica.
Some configuration flags are now marked as "unsafe" and "experimental". Such flags are disallowed by default. Users may access these flags by enabling the additional flags --unlock_unsafe_flags and --unlock_experimental_flags. Usage of such flags is not recommended, as the flags may be removed or modified with no deprecation period and without notice in future Kudu releases.
Client APIs (C++/Java/Python)
The TIMESTAMP column type has been renamed to UNIXTIME_MICROS in order to reduce confusion between Kudu’s timestamp support and the timestamps supported by other systems such as Apache Hive and Apache Impala (incubating). Existing tables will automatically be updated to use the new name for the type.
Clients upgrading to the new client libraries must move to the new name for the type. Clients using old client libraries will continue to operate using the old type name, even when connected to clusters that have been upgraded. Similarly, if clients are upgraded before servers, existing timestamp columns will be available using the new type name.
KuduSession methods in the C++ library are no longer advertised as thread-safe to have one set of semantics for both C++ and Java Kudu client libraries.
The KuduScanToken::TabletServers method in the C++ library has been removed. The same information can now be found in the KuduScanToken::tablet method.
Apache Flume Integration
The KuduEventProducer interface used to process Flume events into Kudu operations for the Kudu Flume Sink has changed, and has been renamed KuduOperationsProducer. The existing KuduEventProducers have been updated for the new interface, and have been renamed similarly.
Known Issues and Limitations of Kudu 1.0.0 Schema and Usage Limitations
Kudu is primarily designed for analytic use cases. You are likely to encounter issues if a single row contains multiple kilobytes of data.
The columns which make up the primary key must be listed first in the schema.
Key columns cannot be altered. You must drop and recreate a table to change its keys.
Key columns must not be null.
Columns with DOUBLE, FLOAT, or BOOL types are not allowed as part of a primary key definition.
Type and nullability of existing columns cannot be changed by altering the table.
A table's primary key cannot be changed.
Dropping a column does not immediately reclaim space. Compaction must run first. There is no way to run compaction manually, but dropping the table will reclaim the space immediately.
Tables must be manually pre-split into tablets using simple or compound primary keys. Automatic splitting is not yet possible. Range partitions may be added or dropped after a table has been created. See Schema Design for more information.
Data in existing tables cannot currently be automatically repartitioned. As a workaround, create a new table with the new partitioning and insert the contents of the old table.
Replication and Backup Limitations
Kudu does not currently include any built-in features for backup and restore. Users are encouraged to use tools such as Spark or Impala to export or import tables as necessary.
To use Kudu with Impala, you must install a special release of Impala called Impala_Kudu. Obtaining and installing a compatible Impala release is detailed in Installing and Using Apache Impala (incubating) With Apache Kudu.
To use Impala_Kudu alongside an existing Impala instance, you must install using parcels.
Updates, inserts, and deletes via Impala are non-transactional. If a query fails part of the way through, its partial effects will not be rolled back.
All queries will be distributed across all Impala hosts which host a replica of the target table(s), even if a predicate on a primary key could correctly restrict the query to a single tablet. This limits the maximum concurrency of short queries made via Impala.
No TIMESTAMP and DECIMAL type support. (The underlying Kudu type formerly known as TIMESTAMP has been renamed to UNIXTIME_MICROS; currently, there is no Impala-compatible TIMESTAMP type.)
The maximum parallelism of a single query is limited to the number of tablets in a table. For good analytic performance, aim for 10 or more tablets per host or use large tables.
Impala is only able to push down predicates involving =, <=, >=, or BETWEEN comparisons between any column and a literal value, and < and > for integer columns only. For example, for a table with an integer key ts, and a string key name, the predicate WHERE ts >= 12345 will convert into an efficient range scan, whereas where name > 'lipcon' will currently fetch all data from the table and evaluate the predicate within Impala.
Authentication and authorization features are not implemented.
Data encryption is not built in. Kudu has been reported to run correctly on systems using local block device encryption (e.g. dmcrypt).
Client and API Limitations
ALTER TABLE is not yet fully supported via the client APIs. More ALTER TABLE operations will become available in future releases.
Other Known Issues
The following are known bugs and issues with the current release of Kudu. They will be addressed in later releases. Note that this list is not exhaustive, and is meant to communicate only the most important known issues.
If the Kudu master is configured with the -log_fsync_all option, tablet servers and clients will experience frequent timeouts, and the cluster may become unusable.
If a tablet server has a very large number of tablets, it may take several minutes to start up. It is recommended to limit the number of tablets per server to 100 or fewer. Consider this limitation when pre-splitting your tables. If you notice slow start-up times, you can monitor the number of tablets per server in the web UI.
Due to a known bug in Linux kernels prior to 3.8, running Kudu on ext4 mount points may cause a subsequent fsck to fail with errors such as Logical start <N> does not match logical start <M> at next level . These errors are repairable using fsck -y, but may impact server restart time.
This affects RHEL/CentOS 6.8 and below. A fix is planned for RHEL/CentOS 6.9. RHEL 7.0 and higher are not affected. Ubuntu 14.04 and later are not affected. SLES 12 and later are not affected.
Issues Fixed in Kudu 1.0.0
For a complete list of new features, changes, bug fixes, and known issues, see the Kudu 1.0.0 Release Notes .
As always, your feedback is appreciated. For general Kudu questions, visit the community page. If you have any questions related to the Kudu packages provided by Cloudera, including installation or configuration using Cloudera Manager, visit the Cloudera Community Forum.