Created 09-03-2016 01:34 AM
Sqoop is faster than NiFi at pulling data from relational databases because it parallelizes throughput whereas NiFi does not (https://community.hortonworks.com/questions/25228/can-i-use-nifi-to-replace-sqoop.html). NiFi is easy to develop and has data lineage and other governance and monitoring capabilities out of the box, which makes a case for using it to ingest relational data to HDFS at least for one time offloads or smallish table sizes (e.g. for data science work). Are there any benchmark results out there that describe just how long NiFi takes to offload relational tables of given sizes? Benchmarks of course are specific to implementations (e.g CPU cores) but some numbers would be informative.
Created 09-04-2016 07:09 PM
Hi @gkeys
I think Nifi and Sqoop are two different tools serving two different use cases and cannot directly be compared, at least not yet.
Sqoop is bundled with bulk loading adapters developed by database vendors and/or Hortonworks together. The purpose of Sqoop is bulk loading of data to and from the RDBMS. It uses fast connectors designed for bulk loading. Sqoop's performance is measured based on the bulk loading tool it is using. Since these are specialized, bulk loading tools designed for batch jobs, Sqoop really shines with these use cases.
Nifi on the other hand is a system designed to move data within the organization as well as bring data in from outside sources and also facilitate data movement between data centers. The data Nifi helps move across is usually live data from applications, logs, devices and other sources producing event data. Given Nifi is so rich in its features, you can also use it to fetch data from lot of other sources including databases and files. For reading data from databases, Nifi uses a JDBC adapter. This would enable you to move x number of records at a time from some database. The bottleneck being the JDBC adapter.
When we measure nifi's performance, we are not including the performance of fetching data from the source. What we are measuring is how fast Nifi is able to move data across as soon as it gets it. That performance is documented here and it's about 50 MB/s of read/write on a typical server. Can a JDBC source deliver data at this rate? Honestly, I doubt it but this has nothing to do with Nifi. It's more a function of the driver and the database and a lot of other variables just like in any other jdbc program.
Created 09-04-2016 07:09 PM
Hi @gkeys
I think Nifi and Sqoop are two different tools serving two different use cases and cannot directly be compared, at least not yet.
Sqoop is bundled with bulk loading adapters developed by database vendors and/or Hortonworks together. The purpose of Sqoop is bulk loading of data to and from the RDBMS. It uses fast connectors designed for bulk loading. Sqoop's performance is measured based on the bulk loading tool it is using. Since these are specialized, bulk loading tools designed for batch jobs, Sqoop really shines with these use cases.
Nifi on the other hand is a system designed to move data within the organization as well as bring data in from outside sources and also facilitate data movement between data centers. The data Nifi helps move across is usually live data from applications, logs, devices and other sources producing event data. Given Nifi is so rich in its features, you can also use it to fetch data from lot of other sources including databases and files. For reading data from databases, Nifi uses a JDBC adapter. This would enable you to move x number of records at a time from some database. The bottleneck being the JDBC adapter.
When we measure nifi's performance, we are not including the performance of fetching data from the source. What we are measuring is how fast Nifi is able to move data across as soon as it gets it. That performance is documented here and it's about 50 MB/s of read/write on a typical server. Can a JDBC source deliver data at this rate? Honestly, I doubt it but this has nothing to do with Nifi. It's more a function of the driver and the database and a lot of other variables just like in any other jdbc program.