Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Nifi ETL: Principles or decision points for ETL on Nifi or more traditional tools (or pig batch)

avatar
Guru

Processors like ExecuteScript, ReplaceText and TransformXML provide powerful means to transform content and thus enable Nifi as an ETL tool. What principles/guidelines/decision points are there to evaluate when Nifi is not the best tool for ETL and when it is? Concerns include capabilities, performance, auditing, and so on.

1 ACCEPTED SOLUTION

avatar
Master Guru

As you mentioned, NiFi does offer many capabilities that can be used to perform ETL functionality. In general though, NiFi is more of a general purpose dataflow tool. NiFi clusters usually scale to dozens of nodes, so if your "transform" needs to run on 100s (or 1000s) of nodes in parallel, then it may be better to use NiFi to bring data to another processing framework like Storm, Spark, etc. Similar for the "extract" part... NiFi has capabilities to extract from relational databases with ExecuteSQL and QueryDatabaseTable processors which can solve many uses case, and for more extreme use-cases something like sqoop can leverage a much larger Hadoop cluster to perform the extraction. So as always there is no single correct answer and it depends a lot on the use-case.

View solution in original post

2 REPLIES 2

avatar
Master Guru

As you mentioned, NiFi does offer many capabilities that can be used to perform ETL functionality. In general though, NiFi is more of a general purpose dataflow tool. NiFi clusters usually scale to dozens of nodes, so if your "transform" needs to run on 100s (or 1000s) of nodes in parallel, then it may be better to use NiFi to bring data to another processing framework like Storm, Spark, etc. Similar for the "extract" part... NiFi has capabilities to extract from relational databases with ExecuteSQL and QueryDatabaseTable processors which can solve many uses case, and for more extreme use-cases something like sqoop can leverage a much larger Hadoop cluster to perform the extraction. So as always there is no single correct answer and it depends a lot on the use-case.

avatar

"ETL" which is embarassingly parallel (all processing logic can execute completely based purely on the contents of the incoming record itself) is in NiFi's sweet spot.

ETL which requires lookups for billions of records, or which must perform "group by" operations fits better in traditional Hadoop solutions like Hive, Pig, or Spark.