Community Articles

DennisJaheruddi · ‎07-23-2020

The Cloudera Data Platform (CDP) comes with a wide variety of tools that move data, these are the same in any cloud as well as on-premises.

Though there is no formal decision tree, I will summarize the key considerations from my personal perspective. In short, it can be visualized like this:

Steps for finding the right tool to move your data

Staying within Hive and SQL queries suffice? > Hive otherwise
No complex operations (e.g. joins) > Nifi otherwise
Batch > Spark otherwise
Already have Kafka Streams/Spark Streaming in use? > Kakfa Streams/Spark Streaming otherwise
Flink

Some notes:

If you can use Nifi or a more complex solution, use Nifi
Use Flink as your streaming engine, unless there is a good reason not to. It is the latest generation of streaming engines.
Currently, I do not recommend using Flink for batch processing yet, but that will likely soon change
I did not include tools like Impala, Hbase/Phoenix, Druid as their main purpose is accessing data
This is a basic decision tree, it should cover most situations but do not hesitate to deviate if your situations ask for this

Also see my related article: Choose the right place to store your data

Full Disclosure & Disclaimer:

I am an Employee of Cloudera, but this is not part of the formal documentation of the Cloudera Data platform. It is purely based on my own experience of advising people in their choice of tooling.

Cloudera Community

Community Articles

How to choose the right tool to move data with the Cloudera Data Platform

Apache Flink

Apache Hive

Apache Kafka

Apache NiFi

Apache Spark