Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Please see the Cloudera blog for information on the Cloudera Response to CVE-2021-4428

What is the difference between Nifi and Kettle?

 
1 ACCEPTED SOLUTION

Accepted Solutions

New Contributor

Context: I just came to Hortonworks after 6 years at Pentaho, the controller of the Kettle project. Agree with everything already said, but will add:

- Kettle is a batch oriented extract-transform-load (ETL) tool, primarily used for loading data warehouses/marts. Competes with tools such as Informatica, Talend, Datastage. But Kettle is the only traditional ETL tool that runs inside Hadoop as native MR or Yarn.

- you can do mico-batches (e.g. run a transform every few seconds/minutes), but it's not really intended for streaming

- the only true streaming capability is to use the Java Message System (JMS) consumer and producer steps to connect to a JMS compliant data bus

- it has "big data" connectors in open source, for read and writing from and too HDFS, MongoDB, Cassandra, Splunk

- it is multi-threaded and performance is generally very good (e.g. one performance test processed 12K rows/second/core). It tends to scale-up linearly with added cores. It also scales-out linearly through J2EE appserver clustering.

- Kettle is a Java runtime engine, and can run natively inside Hadoop as a MapReduce or Yarn job

- it has pretty nice workflow capabilities that Pentaho touts as an alternative to using Oozie. But it also has an Oozie workflow job step.

I'm new to Nifi so can't really contrast the two right now, but hopefully this information is useful.

View solution in original post

4 REPLIES 4

Well, it's a super-loaded question, but I'll try to highlight the most important differences and give some food for thought:

  • Kettle is an ETL tool, came from ETL world, and largely remains there. Pentaho is making a shift to introduce many BI and reporting features, too
  • NiFi is a dataflow management platform. This is a term somewhat new to the IT crowd, but I'm sure over time it will become as ubiquitous as ETL, BI, etc. It has some aspects of ETL, Streaming, Batch, Workflow, but takes a niche of its own. Imagine if you wanted ingest to become a first-class citizen in your IT landscape.
  • NiFi has interactive model, where one 'molds' the flow as the data continues to flow. E.g. there's no requirement to compile and deploy changes to the flow 'somewhere'
  • NiFi's Provenance feature is the biggest differentiator. Think super-charged lineage, where complete data history is captured on top of lineage (not sampling, but full payloads, changes, etc.). This enables powerful search and replay capabilities down the line, too.
  • NiFi has native clustering, remote site-to-site protocol, backpressure, flow control, full REST API among other highlights - look them up 🙂

Guru

Kettle is primarily and ETL tool designed to load static data from one source into another. Nifi is certainly capable of similar kinds of task but it's main focus is dealing with really fast flows of real time events. Nifi can run as a really small single instance JVM suitable to act as a data collection agent for an endpoint as well as scale through clustering to handle very large volumes of data from lots of endpoints. Once a cluster is up and running, changes can be made dynamically, without a redeploy or even much of a disruption to the data flows. For example, an endpoint in the field is sending out events in a JSON format but the application back at the data center expects a JSON object that has more fields than before and is now listening on a different IP and Port in a different data center. Nifi can capture the event in the field and then transform and direct the event to the correct listener in the required format without coding, redeployment, or even much of a disruption to the data flow. The best part is the entire flow is tracked and every modification or action on the event is visible and searchable. This makes it easy to account for and trouble shoot any issues that occur in transit.

New Contributor

Context: I just came to Hortonworks after 6 years at Pentaho, the controller of the Kettle project. Agree with everything already said, but will add:

- Kettle is a batch oriented extract-transform-load (ETL) tool, primarily used for loading data warehouses/marts. Competes with tools such as Informatica, Talend, Datastage. But Kettle is the only traditional ETL tool that runs inside Hadoop as native MR or Yarn.

- you can do mico-batches (e.g. run a transform every few seconds/minutes), but it's not really intended for streaming

- the only true streaming capability is to use the Java Message System (JMS) consumer and producer steps to connect to a JMS compliant data bus

- it has "big data" connectors in open source, for read and writing from and too HDFS, MongoDB, Cassandra, Splunk

- it is multi-threaded and performance is generally very good (e.g. one performance test processed 12K rows/second/core). It tends to scale-up linearly with added cores. It also scales-out linearly through J2EE appserver clustering.

- Kettle is a Java runtime engine, and can run natively inside Hadoop as a MapReduce or Yarn job

- it has pretty nice workflow capabilities that Pentaho touts as an alternative to using Oozie. But it also has an Oozie workflow job step.

I'm new to Nifi so can't really contrast the two right now, but hopefully this information is useful.

View solution in original post

New Contributor

So can NiFi be said to be uniquely placed in systems integration space, something very much similar to Microsoft biztalk