Created 07-27-2016 01:27 AM
Processors like ExecuteScript, ReplaceText and TransformXML provide powerful means to transform content and thus enable Nifi as an ETL tool. What principles/guidelines/decision points are there to evaluate when Nifi is not the best tool for ETL and when it is? Concerns include capabilities, performance, auditing, and so on.
Created 07-27-2016 02:48 AM
As you mentioned, NiFi does offer many capabilities that can be used to perform ETL functionality. In general though, NiFi is more of a general purpose dataflow tool. NiFi clusters usually scale to dozens of nodes, so if your "transform" needs to run on 100s (or 1000s) of nodes in parallel, then it may be better to use NiFi to bring data to another processing framework like Storm, Spark, etc. Similar for the "extract" part... NiFi has capabilities to extract from relational databases with ExecuteSQL and QueryDatabaseTable processors which can solve many uses case, and for more extreme use-cases something like sqoop can leverage a much larger Hadoop cluster to perform the extraction. So as always there is no single correct answer and it depends a lot on the use-case.
Created 07-27-2016 02:48 AM
As you mentioned, NiFi does offer many capabilities that can be used to perform ETL functionality. In general though, NiFi is more of a general purpose dataflow tool. NiFi clusters usually scale to dozens of nodes, so if your "transform" needs to run on 100s (or 1000s) of nodes in parallel, then it may be better to use NiFi to bring data to another processing framework like Storm, Spark, etc. Similar for the "extract" part... NiFi has capabilities to extract from relational databases with ExecuteSQL and QueryDatabaseTable processors which can solve many uses case, and for more extreme use-cases something like sqoop can leverage a much larger Hadoop cluster to perform the extraction. So as always there is no single correct answer and it depends a lot on the use-case.
Created 07-27-2016 05:22 AM
"ETL" which is embarassingly parallel (all processing logic can execute completely based purely on the contents of the incoming record itself) is in NiFi's sweet spot.
ETL which requires lookups for billions of records, or which must perform "group by" operations fits better in traditional Hadoop solutions like Hive, Pig, or Spark.