Support Questions
Find answers, ask questions, and share your expertise
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

NiFi multiple RDBMS tables

NiFi multiple RDBMS tables


What is the best approach to extract data from relational tables assuming I have 100s of tables using NiFi? I understand I can use sqoop for this, but what is the best design approach to combine these two or use NiFi alone?


Re: NiFi multiple RDBMS tables

For a pure NiFi solution, you can use the ListDatabaseTables processor, which will emit a flow file for each table found using the specified search patterns (see the processor documentation for details). Then you can feed that into a ReplaceText where you create the SQL statements you want to execute ("SELECT * from ${db.table.fullname}" for example). A standalone NiFi instance is less efficient for this use case, compared to Sqoop which spreads the work out over a Hadoop cluster. Instead if you have a NiFi cluster, you can route the ReplaceText to a Remote Process Group that points to an Input Port in the same cluster. This has the effect of load-balancing the flow files across instances of downstream processors, thereby parallelizing the flow (somewhat similar to the Sqoop approach). The Input Port could be connected to an ExecuteSQL processor to actually fetch the data, and the rest of your downstream processing would follow.

Another thing to consider is the scale. Sqoop operates towards the scale of your Hadoop cluster, and NiFi operates towards the scale of the NiFi cluster. If you are working with huge amounts of data and need to massively parallelize the database fetches, Sqoop might be the better choice. If instead you have a NiFi cluster large enough for your parallelization needs, the above solution should work well. Not to mention all the other benefits you get with NiFi such as provenance, replay, a great UI for flow design, etc.