What's New @ Cloudera

araujo · ‎10-27-2024

Cloudera’s Data In Motion Team is pleased to announce the release of the Cloudera Streaming Analytics - Kubernetes Operator 1.1, an integral component of Cloudera Streaming - Kubernetes Operator. This release includes improvements to SQL Stream Builder as well as updates to Apache Flink 1.19.1.

Use Cases

Event-Driven Applications: Stateful applications that ingest events from one (or more) event streams and react to incoming events by triggering computations, state updates, or external actions.

Apache Flink excels in handling the concept of time and state for these applications and can scale to manage very large data volumes (up to several terabytes). It has a rich set of APIs, ranging from low-level controls to high-level functionality, like Flink SQL, enabling developers to choose the most suitable options for the implementation of advanced business logic.

However, Apache Flink’s outstanding feature for event-driven applications is its support for savepoints. A savepoint is a consistent state image that can be used as a starting point for compatible applications. Given a savepoint, an application can be updated or adapt its scale, or multiple versions of an application can be started for A/B testing.

Examples:
- Fraud detection
- Anomaly detection
- Rule-based alerting
- Business process monitoring
- Web application (social network)
Data Analytics Applications: With a sophisticated stream processing engine, analytics can also be performed in real-time. Streaming queries or applications ingest real-time event streams and continuously produce and update results as events are consumed. The results are written to an external database or maintained as internal state. A dashboard application can read the latest results from the external database or directly query the internal state of the application.

Apache Flink supports streaming as well as batch analytical applications.

Examples:
- Quality monitoring of telco networks
- Analysis of product updates & experiment evaluation in mobile applications
- Ad-hoc analysis of live data in consumer technology
- Large-scale graph analysis
Data Pipeline Applications: Streaming data pipelines serve a similar purpose as Extract-transform-load (ETL) jobs. They transform and enrich data and can move it from one storage system to another. However, they operate in a continuous streaming mode instead of being periodically triggered. Hence, they can read records from sources that continuously produce data and move it with low latency to their destination.

Examples:
- Real-time search index building in e-commerce
- Continuous ETL in e-commerce

Release Highlights

Rebase to Apache Flink 1.19.1: Streaming analytics deployments, including SQL Stream Builder, now support Apache Flink 1.19.1, including the Apache Flink improvements below. For more information on improvements and deprecations, please check the Apache Flink 1.19 release announcement.
- Custom Parallelism for Table/SQL Sources: The DataGen connector now supports setting of custom parallelism for performance tuning via the scan.parallelism option. Support for other connectors will come in future releases.
- Configure Different State Time to Live (TTLs) Using SQL Hint: Users have now a more flexible way to specify custom time-to-live (TTL) values for state of regular joins and group aggregations directly within their queries by utilizing the STATE_TTL hint.
- Named Parameters: Named parameters can now be used when calling a function or stored procedure in Flink SQL.
- Support for SESSION Window Table-Valued Functions (TVFs) in Streaming Mode: Users can now use SESSION Window table-valued functions (TVF) in streaming mode.
- Support for Changelog Inputs for Window TVF Aggregation: Window aggregation operators can now handle changelog streams (e.g., Change Data Capture [CDC] data sources, etc.).
- New UDF Type: AsyncScalarFunction: The new AsyncScalarFunction is a user-defined asynchronous ScalarFunction that allows for issuing concurrent function calls asynchronously.
- MiniBatch Optimization for Regular Joins: The new mini-batch optimization can be used for regular join to reduce intermediate results, especially in cascading join scenarios.
- Dynamic Source Parallelism Inference for Batch Jobs: Allows source connectors to dynamically infer the parallelism based on the actual amount of data to consume.
- Standard Yet Another Markup Language (YAML) for Apache Flink Configuration: Apache Flink has officially introduced full support for the standard YAML 1.2 syntax in the configuration file.
- Profiling JobManager/TaskManager on Apache Flink Web: Support for triggering profiling at the JobManager/TaskManager level.
- New Config Options for Administrator Java Virtual Machine (JVM) Options: A set of administrator JVM options are available to prepend the user-set JVM options with default values for platform-wide JVM tuning.
- Using Larger Checkpointing Interval When Source is Processing Backlog: Users can set the execution.checkpointing.interval-during-backlog to use a larger checkpoint interval to enhance the throughput while the job is processing backlog if the source is backlog-aware.
- CheckpointsCleaner Clean Individual Checkpoint States in Parallel: Now, when disposing of no longer needed checkpoints, every state handle/state file will be disposed of in parallel for better performance.
- Trigger Checkpoints through Command Line Client: The command line interface supports triggering a checkpoint manually.
- New Interfaces to SinkV2 That Are Consistent with Source API.
- New Committer Metrics to Track the Status of Committables.

Rebase to Apache Flink Kubernetes Operator 1.9.0: The highlights of this update are listed below. For more details on improvements please check the Apache Flink Kubernetes Operator 1.9.0 release announcement.
- Operator Optimizations: Reduce overall memory usage of the operator.
- High Availability: Fixed but causing unpredictable behaviour when changing watched namespaces in a HA setup.
- Autoscaler CPU And Memory Quotas: The user can now set CPU and memory quotas that the autoscaler will respect and won’t scale beyond that.
- Autoscaler Improvements: Several improvements for autoscaling are included in this release.

Support for Python User-Defined Functions (UDFs) in SQL Stream Builder: The current Javascript UDFs in SQL Stream Builder will not work in Java 17 and later versions due to the deprecation and removal of the Nashorn engine from the Java Development Kit (JDK). The addition of Python UDFs to SQL Stream Builder will allow customers to use Python to create new UDFs that will continue to be supported on future JDKs. Javascript UDFs are being deprecated in this release and will be removed in a future release. Cloudera recommends that customers start using Python UDFs for all new development and start migrating their JavaScript UDFs to Python UDFs to prepare for future upgrades.
Session cluster management in SQL Stream Builder: SQL Stream Builder now displays information about existing session clusters and allows for those clusters to be terminated from the UI.
Connectors and formats included by default in the SQL Stream Builder image: The following connectors and formats are now included in the SQL Stream Builder's Docker image by default, enabling users to use them without having to build their own customized images.
- Connectors:
  - Kafka
  - JDBC
  - CDC (MySQL, Oracle, Postgres, Db2, SqlServer)
  - Amazon S3
  - Azure Blob Storage
  - Google Cloud Storage
- Formats:
  - JSON
  - Avro
  - ORC
  - Parquet

Global logging configuration for Configuring logs for all SSB jobs: A new global settings view enables default logging configurations to be set by the administrator. These settings will be applied to all streaming jobs by default and can be overridden at the job level. This ensures that a consistent logging standard can be applied by default for all users and developers.

Please see the Release Notes for the complete list of fixes and improvements.

Getting to the new release

To upgrade to Cloudera Streaming Analytics Operator 1.1, check out this upgrade guide. Please note, if you are installing the operator for the first time use this installation overview.

Cloudera Community

What's New @ Cloudera

[RELEASED] Cloudera Streaming Analytics - Kubernetes Operator 1.1

Use Cases

Release Highlights

Getting to the new release

Public Resources