Community Articles

Find and share helpful community-sourced technical articles.
Celebrating as our community reaches 100,000 members! Thank you!

The Cloudera Data Platform (CDP) comes with a wide variety of tools that move data, these are the same in any cloud as well as on-premises.

Though there is no formal decision tree, I will summarize the key considerations from my personal perspective. In short, it can be visualized like this: 




Steps for finding the right tool to move your data

  1. Staying within Hive and SQL queries suffice? > Hive otherwise
  2. No complex operations (e.g. joins) > Nifi otherwise
  3. Batch > Spark otherwise
  4. Already have Kafka Streams/Spark Streaming in use? > Kakfa Streams/Spark Streaming otherwise
  5. Flink

Some notes:

  • If you can use Nifi or a more complex solution, use Nifi
  • Use Flink as your streaming engine, unless there is a good reason not to. It is the latest generation of streaming engines.
  • Currently, I do not recommend using Flink for batch processing yet, but that will likely soon change
  • I did not include tools like Impala, Hbase/Phoenix, Druid as their main purpose is accessing data
  • This is a basic decision tree, it should cover most situations but do not hesitate to deviate if your situations ask for this

Also see my related article: Choose the right place to store your data

Full Disclosure & Disclaimer:

I am an Employee of Cloudera, but this is not part of the formal documentation of the Cloudera Data platform. It is purely based on my own experience of advising people in their choice of tooling.