Community Articles

RyanCicak · ‎08-12-2024

In this article, I'll walk you through a Flink application I developed to process real-time data and write the output to HDFS within the Cloudera Data Platform (CDP). But first, let’s discuss why Flink is a powerful choice for streaming analytics and how Cloudera’s platform can help you make the most of it.

Why Flink?

Flink excels in scenarios where low-latency processing and real-time analytics are critical. Compared to Spark, Flink often shines in streaming use cases due to its advanced event-time processing and lower latency. However, Spark remains a robust choice for batch processing and when integrating with existing Spark-based pipelines.

This flexibility is where Cloudera stands out as the obvious choice for streaming analytics. Cloudera supports both Flink and Spark, giving you the power to choose the right tool for your specific use case. Beyond just tooling, Cloudera’s hybrid platform also allows you to reduce your cloud bill by running applications on-premise, while maintaining the flexibility to run those same applications in the cloud. This makes Cloudera an ideal choice for developers who need a platform that adapts to both on-premise and cloud environments seamlessly.

Application Overview

Now, let’s dive into the Flink application itself, designed for real-time data processing with three key stages:

Reading Data from a Socket: The application starts by connecting to a socket on localhost:10010, continuously streaming in text data line by line. In Flink, this represents the "source" stage of the application. Since the data is read from a single socket connection, the parallelism for this stage is inherently set to 1. This means that while you can configure parallelism when running your application, it won’t impact the source stage because reading from the socket is done only once.
Processing Data Using Time Windows: Once the data is ingested, it moves to the "transformation" stage. Here, the application splits the data into individual words, counts each one, and aggregates these counts over a 5-second time window. This stage takes full advantage of Flink's parallel processing capabilities, allowing you to scale the transformations by configuring parallelism as needed.
Writing Output to HDFS: Finally, the "target" stage involves writing the processed results to HDFS. One of the major benefits of running this application within Cloudera CDP is that Flink is integrated via the Cloudera Streaming Analytics (CSA) service. This integration means you don't need to worry about configuring connections to HDFS, even with Kerberos enabled out-of-the-box. CDP handles all these configurations for you, making it easier to securely write data to HDFS without additional setup.

How to Run This Application in Cloudera CDP

Running this Flink application in Cloudera CDP is straightforward. Here’s how you do it:

1. Set Up Your Maven Project:

Ensure your Maven project is configured correctly. Use the pom.xml provided earlier to manage dependencies and build the application.

Code can be found in GitHub.

2. Build the Application:

Use Maven to build your application into a single JAR:

mvn clean package

It’s important to note that in your pom.xml, the dependencies are marked as provided. This is crucial because Cloudera CDP already loads these dependencies out of the box. By marking them as provided, you ensure that they are not included in the JAR, avoiding any potential conflicts or unnecessary duplication.

3. Upload the JAR to Cloudera CDP:

Upload the generated JAR file to your HDFS or S3 storage in Cloudera CDP. Make sure to note the path where you upload the JAR.

4. Run the Flink Application:

Execute the following command to run your Flink application on YARN in Cloudera CDP:

flink run-application -t yarn-application -p 1 -ynm PortToHDFSFlinkApp PortToHDFSFlinkApp-1.0-SNAPSHOT.jar

Here’s a breakdown of the command:

-t yarn-application: Specifies that the application should run as a YARN application.
-p 1: Sets the parallelism to 1, ensuring that the source stage runs with a single parallel instance. This is critical since the socket connection is inherently single-threaded.
-ynm PortToHDFSFlinkApp: Names the application, making it easier to identify in the YARN resource manager.
-s hdfs:///path/to/savepoints/savepoint-xxxx: Specifies the path to the savepoint from which the job should resume. (optional)

5. Interact with the Application:

Once the application is launched within CDP, you can access the Flink UI to find the node where the source is running on port 10010. After identifying the correct node, you can interact with the application by logging into that node and using the following command:

nc -l 10010

This command will start a listener on port 10010, allowing you to type words directly into the terminal. Each word you type, followed by pressing enter/return, will be processed by the Flink application in real-time. This is a simple yet powerful way to test the application's functionality and observe how data flows from the source, through the transformation stage, and finally to the HDFS target.

6. Monitor the Job:

While the job is running, you can monitor its progress through the Flink dashboard available in Cloudera CDP. This dashboard provides valuable insights into the job’s performance, including task execution details and resource usage.

Conclusion

By leveraging Cloudera CDP’s integration of Flink through Cloudera Streaming Analytics, you can easily deploy and manage complex streaming applications without worrying about the underlying configurations—like connecting to HDFS in a Kerberized environment. This PaaS setup simplifies deployment, allowing you to focus on processing and analyzing your data efficiently.

With Cloudera’s support for both Flink and Spark, you get the best of both worlds in streaming analytics. Whether you’re leveraging Flink for real-time data processing or Spark for batch jobs, Cloudera guides you to the right tools for your needs and ensures you can implement them with ease.

Cloudera Community

Community Articles

Running a Flink Application in Cloudera CDP: A Step-by-Step Guide

Cloudera Streaming Analytics (CSA)

Why Flink?

Application Overview

How to Run This Application in Cloudera CDP

Conclusion