Community Articles

Find and share helpful community-sourced technical articles.
avatar
Super Collaborator

In part 1 of this series we talked about the growing relevance of streaming technologies and covered the need for existing Cloudera customers currently using Apache Flume, to consider moving over to Cloudera DataFlow (CDF).

 

Cloudera DataFlow is an umbrella term that covers the streaming technologies from Cloudera. CDF is supported on CDH 5 / CDH 6 and HDP 2 / HDP 3. So there is nothing stopping customers adopting Cloudera DataFlow right now so that they are in a supported configuration for when they upgrade to the new Cloudera Data Platform (CDP).

 

CDF includes the technology to address a number of areas:

  • Edge Flow Management
  • Core Flow Management
  • Stream Processing
  • Streaming Analytics
  • Enterprises Services

A good summary of these components can be found in the blog post Introducing Cloudera DataFlow (CDF).

 

Cloudera DataFlow - Data-In-Motion Platform

image9.png

 

If you are a traditional Cloudera customer using the Cloudera Distribution of Apache Kafka, there are a number of new and exciting management technologies available via CDF. For example, the Cloudera Streams Management component includes:

 

  • Cloudera Stream Messaging Manager which provides a visual and interactive user interface for managing topics in Apache Kafka.
  • Cloudera Streams Replication Manager for managing replication between Kafka clusters based on MirrorMaker 2.

However, Apache Flume has been replaced in CDF by Apache Nifi and MiNiFi. There are a number of benefits of using Apache Nifi / MiNiFi over Apache Flume:

  • It is very simple to use with an intuitive user interface. This enhances user productivity with a drag and drop approach to designing data pipelines rather than having to develop lots of lines of code and configuration files.
  • There are 290+ pre-built processors for data source connectivity, ingestion, transformation, and content routing.
  • Nifi supports Nifi Registry for version controlling dataflows and also supports the software development lifecycle (SDLC) when it comes to promoting flows from one environment to another e.g. development to production.
  • Point-in-time capability - allowing you to go back to a previous point in time and inspect the data as it was at that point and replay it again downstream.
  • Scale-out architecture - adding more nodes increases the network and disk bandwidth for ingestion and transformation.
  • Data lineage and provenance are built-in features of Apache Nifi with graphical information and metrics that describe data on their journey from source to target. 
  • Cloudera Edge Management (CEM) provides a management user interface for deploying and managing MiNiFI agents on edge devices.

 

Continuous data delivery, streaming applications and real-time analysis are becoming increasingly important and more widely adopted as part of a data architecture strategy. However, so is the need to adhere and comply with data regulation and protection laws such as GDPR in the EU and CCPA in California. This is why technologies such as Apache Nifi with graphical data pipelines and built-in support for data lineage and provenance provide a strong framework to work towards meeting regulatory compliance requirements.

 

image14.png

 

One of the reasons that customers adopt Cloudera technology is because of the portfolio of technology that we offer all under a governed, secure and integrated data and analytics platform. This means that we can integrate and build differing streaming applications to address a variety of use cases. For example, Cloudera supports Apache HBase and Apache Kudu to use as the backend storage for real-time applications. In addition, Cloudera Machine Learning means that we can build predictive models and manage and deploy them into streaming applications. This is why we describe Cloudera as an end-to-end Edge2AI platform.

2,468 Views