Created on 07-23-202001:49 PM - edited 07-28-202011:48 AM
The Cloudera Data Platform (CDP) comes with a wide variety of tools that move data, these are the same in any cloud as well as on-premises.
Though there is no formal decision tree, I will summarize the key considerations from my personal perspective. In short, it can be visualized like this:
Steps for finding the right tool to move your data
Staying within Hive and SQL queries suffice? > Hive otherwise
No complex operations (e.g. joins) > Nifi otherwise
Batch > Spark otherwise
Already have Kafka Streams/Spark Streaming in use? > Kakfa Streams/Spark Streaming otherwise
Flink
Some notes:
If you can use Nifi or a more complex solution, use Nifi
Use Flink as your streaming engine, unless there is a good reason not to. It is the latest generation of streaming engines.
Currently, I do not recommend using Flink for batch processing yet, but that will likely soon change
I did not include tools like Impala, Hbase/Phoenix, Druid as their main purpose is accessing data
This is a basic decision tree, it should cover most situations but do not hesitate to deviate if your situations ask for this
I am an Employee of Cloudera, but this is not part of the formal documentation of the Cloudera Data platform. It is purely based on my own experience of advising people in their choice of tooling.