Member since
10-07-2021
10
Posts
1
Kudos Received
0
Solutions
08-10-2022
01:55 PM
1 Kudo
You can easily get started using SQL Stream Builder powered by Apache Flink with Apache Pulsar using the Cloudera Stream Processing Community Edition (CSP CE) and the Stream Native Pulsar SQL Connector. In this tutorial we will go over the steps needed to integrate these two together. Make sure you have a working CSP CE deployment by following the documentation. https://docs.cloudera.com/csp-ce/latest/index.html Build the Stream Native Pulsar Connector for Apache Flink https://github.com/streamnative/pulsar-flink/ Clone the repository to your machine and checkout the version for Flink 1.14 git clone https://github.com/streamnative/pulsar-flink.git
git checkout tags/release-1.14.3.4 Build the connector repo using maven mvn clean install -DskipTests Open SQL Stream Builder (SSB) and add the Connector Click Connectors on the Left Menu Click Register Connector in the top right corner Name the connector “pulsar” Select CSV, JSON, AVRO, and RAW Add the pulsar-flink-sql-connector_2.11-1.14.3.4.jar file Depending on when you build this the version and name could be slightly different This is found in the github cloned folder under /pulsar-flink/pulsar-flink-sql-connector/target You could also create this with properties that are required but in the case of this tutorial we will skip that part Restart the SSB Container to add the new jar to the class path docker restart $(docker ps -a --format '{{.Names}}' --filter "name=ssb-sse.(\d)") Create a Table that uses Pulsar To get started we will create just a very simple table that will be able to read RAW data as a single large string. Click Console on the right Add the DDL to the window You will need to update the topic, admin-url, and service-url In the case of the admin-url and service-url PULSAR_HOST should be the IP Address of the location of the Pulsar instance/container It's also very important to make sure that 'generic' = 'true' is part of the properties to ensure that SSB provides the correct table/topic names when executing a query. For more information see this ticket - https://github.com/streamnative/pulsar-flink/issues/528 Click Execute CREATE TABLE pulsar_test (
`city` STRING
) WITH (
'connector' = 'pulsar',
'topic' = 'topic82547611',
'value.format' = 'raw',
'service-url' = 'pulsar://PULSAR_HOST:6650',
'admin-url' = 'http://PULSAR_HOST:8080',
'scan.startup.mode' = 'earliest',
'generic' = 'true'
); Start a Pulsar Container Check out the latest Pulsar Documentation for more details on running a standalone pulsar container https://pulsar.apache.org/docs/getting-started-docker/ Run the docker command to get a standalone Pulsar deployment docker run -it -p 6650:6650 -p 8080:8080 --mount source=pulsardata,target=/pulsar/data --mount source=pulsarconf,target=/pulsar/conf apachepulsar/pulsar:2.10.1 bin/pulsar standalone Push messages to Pulsar and start a query in SSB It’s possible to use SSB to write data to Pulsar with an INSERT SQL statement by running the below command in the SSB Console and clicking Execute. Tip - Parts of console code can be executed by just highlighting the code you wish to execute. INSERT INTO pulsar_test
VALUES ('data 1'),('data 2'),('data 3') Flink and SSB come with the ability to generate data with Faker. This table will randomly generate messages that we can use to insert into our pulsar table. CREATE TABLE fake_data (
city STRING )
WITH (
'connector' = 'faker',
'rows-per-second' = '1',
'fields.city.expression' = '#{Address.city}'
); The data that is generated can be inserted into our pulsar table just like a normal SQL query insert into pulsar_test
select * from fake_data; Query Data from Pulsar It can also be queried back out the same way and a sample of results displayed in the UI. Tip - When you run a query, only a small sample of the data is shown in the SSB UI by default. To see all records the configuration of the sampler needs to be changed. To have all data returned in the SSB UI console click Job Settings -> Sample Behavior -> Sample All Messages Select the data from the test table select * from pulsar_test PS - It's also possible to publish messages with the Pulsar client and read them back out. To look at how to publish with the Pulsar client see here: https://pulsar.apache.org/docs/getting-started-docker/#produce-a-message
... View more
06-23-2022
01:05 PM
The Community Edition of Cloudera Stream Processing (CSP) makes developing stream processors easy as it can be done right from your desktop or any other development node. Analysts, data scientists, and developers can now evaluate new features, develop SQL-based stream processors locally using SQL Stream Builder powered by Flink, and develop Kafka Consumers/Producers and Kafka Connect Connectors, all locally before moving to production. Download CSP CE https://www.cloudera.com/downloads/cdf/csp-community-edition.html CSP CE Documentation https://docs.cloudera.com/csp-ce/latest/index.html Tutorials Streaming Analytics Documentation Tutorial Streams Messaging Documentation Tutorial Using SQL Stream Builder with Apache Pulsar
... View more
06-23-2022
01:02 PM
Welcome to the Stream Processing Group!
It's our hope that here you will be able to learn more about Stream Processing by asking questions and reading other materials that similar minded folks have shared. By using the Stream Processing Community Edition you will be able to easily start to play with the technology on your local machine, allowing you to understand the value of it for solving your use cases. If you're already an existing Stream Processing user the Community Edition is your gateway into trying new features to better understand the value of upgrading your existing deployments to the latest versions.
Cloudera Stream Processing is powered by Apache Flink and Kafka and provides a complete, enterprise-grade stream management and stateful processing solution. The combination of Kafka as the storage streaming substrate, Flink as the core in-stream processing engine, and first-class support for industry standard interfaces like SQL and REST allows developers, data analysts, and data scientist to easily build hybrid streaming data pipelines that power real-time data products, dashboards, business intelligence apps, microservices, and data science notebooks.
Check out our getting started post linked below to download the community edition and get started with some simple tutorials, find other connectors, or read some blogs to learn more.
Click Here To Get Started
... View more
05-19-2022
04:18 PM
With the launch of CDP Public Cloud 7.2.15, Cloudera Streaming Analytics for Data Hub deployments has gotten some powerful new features with support for Dead Letter Queues. SQL Stream Builder now has a de-serialization tab that allows users to configure how errors with messages are handled as part of their SQL Queries. Connections using Kafka can setup multiple options for what to do on a failure to process the message rather than causing the job to stop the message can be handled in multiple ways such as: Failing the Job, Ignoring the Message, Ignoring and Logging the Message, ignoring the message and write it to a Dead Letter Queue (another Kafka Topic.) Read more about the de-serialization capabilities in the Streaming Analytics 7.2.15 Documentation .
... View more
Labels:
05-19-2022
03:15 PM
With the launch of CDP Public Cloud 7.2.15, Cloudera Streams Messaging for Data Hub deployments has gotten some powerful new features! Streams Messaging now supports Multi-Availability Zone Deployments enabling High Availability, OAuth2 support for Clients connecting to Kafka Brokers and Schema Registry, Streams Messaging Manager Connect UI Changes, Kafka Connect security features, Debezium CDC Connectors, ability to import Kafka data to Atlas and much more. Streams Messaging High Availability cluster definition and template. You can use the template and definitions to deploy highly available Streams Messaging clusters that leverage multiple availability zones and ensure that functionality is not degraded when a single availability zone has an outage Three new cluster definitions are introduced for Streams Messaging. The new definitions are as follows: Streams Messaging High Availability for AWS Streams Messaging High Availability for Azure (Technical Preview) Streams Messaging High Availability for Google Cloud (Technical Preview) OAuth2 authentication available for Kafka Oauth2 authentication support is added for the Kafka service. You can now configure Kafka brokers to authenticate clients using Oauth2. For more information, see OAuth2 authentication . Changes for Streams Messaging Manager Kafka Connect UI The Streams Messaging Manager for deploying Kafka Connect Connectors has changed from a String based JSON object to a submission form. Other enhancements include: Ability to import existing JSON Configurations and populate the form Users can import NiFi Flow Definitions for Stateless Connectors and Enhance the forum with the Parameters defined in the NiFi Flow Definition Mark forum fields as secret to protect specific configurations by storing them in the new Kafka Connect Secret Storage Validation Messages now are shown at the specific field with the issue rather then an error message about the entire JSON configuration being used Secure Kafka Connect Kafka Connect is now generally available in the Public Cloud and can be used in production environments. This is the result of multiple changes, improvements, and new features related to Kafka Connect security including the following: Ranger support for Kafka Connect that allows policies to be defined at the Connect cluster level or the named connectors themselves allowing for secure multi-tenant experiences Secret storage to securely store sensitive configurations used with connectors such as passwords or tokens Connect REST API can be secured by enabling SPNEGO authentication. Kafka Connect Connectors can be configured to override the JAAS, and restrict the usage of the Worker principal Debezium Connector support The following change data capture (CDC) connectors are added to Kafka Connect: Debezium MySQL Source Debezium Postgres Source Debezium SQL Server Source Debezium Oracle Source Each of the connectors require CDP specific steps before they can be deployed. For more information, see Connectors . Importing Kafka entities into Atlas Kafka topics and clients can now be imported into Atlas as entities (metadata) using a new action available for the Kafka service in Cloudera Manager. The new action is available at Kafka service>Actions>Import Kafka Topics Into Atlas. The action serves as a replacement/alternative for the kafka-import.sh tool. For more information, see Importing Kafka entities into Atlas . Learn more about all the features above and more released as part of Cloudera Streams Messaging for Public Cloud Data Hub in the Streams Messaging 7.2.15 Documentation.
... View more
Labels:
05-19-2022
11:25 AM
With the launch of Cloudera Streaming Analytics 1.7.0 Parcel, Cloudera Streaming Analytics for Private Cloud deployments has gotten some powerful new features! Reworked Streaming SQL Console The User Interface (UI) of SQL Stream Builder (SSB), the Streaming SQL Console has been reworked with new design elements. High Availability for SSB You can use SQL Stream Builder (SSB) with a Load Balancer to distribute tasks over resources in case of a single point of failure. SQL Server CDC Connector support MS SQL Server Change Data Capture (CDC) connector is added to the list of supported connectors. The SQL Server CDC connector can be used with Flink DDL on the Streaming SQL Console either directly from the SQL Editor or using the DDL template. Configuring retention time or row count for Materialized Views You can configure how data should be retained for Materialized Views based on time or row count. Learn more about new features in the Streaming Analytics 1.7.0 Documentation
... View more
Labels:
03-16-2022
08:03 AM
Check out the blog detailing the latest features here. https://blog.cloudera.com/new-features-in-cloudera-streams-messaging-for-cdp-public-cloud-7-2-14/
... View more
10-27-2021
02:32 PM
With the launch of the Cloudera Public Cloud 7.2.12, the Streams Messaging for Data Hub deployments have gotten some interesting new features! From this release, Streams Messaging templates will support scaling with automatic rebalancing allowing you to grow or shrink your Apache Kafka cluster based on demand. Another notable item is that Streams Replication Manager (SRM) will now support multi-cluster monitoring patterns and aggregate replication metrics from multiple SRM deployments into a single viewable location in Streams Messaging Manager (SMM.) Last but not least is Apache Atlas and Schema Registry (SR) Integration, now you will be able to view Kafka topic schemas in Atlas letting you navigate data lineage from consumers and producers and see the schema of the topic they use without having to navigate back to the SR UI. Learn more about these features in our CSM 7.2.12 release blog! https://blog.cloudera.com/new-features-in-cloudera-streams-messaging-public-cloud-7-2-12/
... View more