Created 03-30-2016 07:12 PM
What factors should inform whether I use NiFi or Sqoop for ingesting my data?
Created 03-30-2016 07:59 PM
For transferring data out of an relational database, Sqoop is still the way to go. The main reason is that Nifi will not parallelize the loading of the data automatically. When you create an ExecuteSQL processor it will start pulling data fromt he DB on the thread that the processor is running on. The only way to get parallelism at this point is to create multiple processors each pulling a partition of the target data. Sqoop will take the SQL statement and create workers that will pull their partition of data this automatically providing parallelism and ensuring the fastest most efficient load of RDBMS data. Don't get me wrong,Nifi is great for just about any simple event processing use case that can be concieved but for a pure large scale RDBMS ingest use case Sqoop is the way to go for now.
Created 03-30-2016 07:50 PM
Apache NiFi is a tool to build a dataflow pipeline just the right tool for Internet of thing (iOT),Internet of Everything (IoE) or any data in motion using inbuilt connectors (known as processors in NiFi world) so it can Get/Put data from/to HDFS, Hive, RDBMS, Kafka etc. out of the box. It also has really cool & user friendly interface which can be used to build the dataflow in minutes by dragging and dropping processors.
Sqoop on the other hand is designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases,so depending on the application you are building you have a choice of tools at hand.
Created 03-30-2016 07:59 PM
For transferring data out of an relational database, Sqoop is still the way to go. The main reason is that Nifi will not parallelize the loading of the data automatically. When you create an ExecuteSQL processor it will start pulling data fromt he DB on the thread that the processor is running on. The only way to get parallelism at this point is to create multiple processors each pulling a partition of the target data. Sqoop will take the SQL statement and create workers that will pull their partition of data this automatically providing parallelism and ensuring the fastest most efficient load of RDBMS data. Don't get me wrong,Nifi is great for just about any simple event processing use case that can be concieved but for a pure large scale RDBMS ingest use case Sqoop is the way to go for now.
Created 04-03-2017 09:15 PM
@Vadim Vaks it may be worth noting that the more recent GenerateTableFetch processor functionality provides additional flexibility in parallelizing the retrieval of data from an RDBMS
Created 03-31-2016 04:39 AM
@Randy Gelhausen @Geoffrey Shelton Okot @Vadim
In my opinion, Nifi has an advantage over Sqoop because development time is significantly reduced. If you are not planning on bringing TB of data and your incremental data volume is not huge, you can use NiFi to bring data. Bringing data in parallel using Sqoop is a double edge sword. Data gets loaded faster, but it puts load on source data store which may not be acceptable. In an enterprise environment, there are limits on number of connections open to database and parallelism in NiFi can be obtained working on independent datasets or a partition of target data. To me, NiFi shortcomings are more on operational side with not robust options for notification in case of failure.
Created 03-31-2016 12:03 PM
There is certainly no reason why you could not use Nifi for one off, movement of small amounts of data or for repeated enrichment of events from an RDBMS. But just as you would not use a hand axe to chop down a Redwood, you generally would not use an event/message stream processing tool to handle a large static ETL job such as a database export. It's simply a matter of using the tool for the purpose that it was designed for.
Created 04-01-2016 12:07 PM
Neither Sqoop nor NiFi are as minimally invasive as for instance Oracle GoldenGate, which checks Oracle's REDO logs for changes rather than fires queries based on a schedule. In many cases, Sqoop or NiFi are fine, but when you need to make sure that the DB is not overloaded by many consecutive requests, it's an idea to look at non-OSS technologies. Some of the open-source alternatives seem to re-invent the elliptic wheel over and over again.
Sqoop at least allows you to specify a table (if you want the whole thing), whereas in NiFi you need to specify the query, which is effectively the same just more typing. Sqoop's split-by logic just takes the MIN/MAX of a column, which may not be very efficient (if you have no indexes). I'm not very familiar with the performance of NiFi for relational DBs and the load on those because I haven't found NiFi to be too useful. For transporting the data, it's perhaps fine, but any manipulation is likely going to be very, very tedious. That's what ETL tools have been designed for. Most of them nowadays have connectors for HDFS and the likes.
Disclaimer: I do not work for, I am not affiliated with, and I am definitely not paid by Oracle.
Created 04-10-2016 03:00 AM
@Ian HellströmYou don't need to specify exact SQL query using NiFi also. Query can be generated using NiFi expressions which can be driven from any input file/table. NiFi is not the best tool for performing any ETL kind of operations but to ingest data to HDFS, it is effective.
Created 02-27-2018 09:03 PM
wanted to mention my post on a subject