Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

100 change data capture processors for ETL into DWH

avatar
Expert Contributor

Hi All, Wish you a healthy life!!

I am trying to load a DWH server (mysql) with incoming data from 100 mysql databases using NiFi.

I am using CaptureChangeMySQL (clients) of NiFi to capture changes and applying them incrementally.

 

I am successful in my PoC which loads a single DWH table from 2 different databases.

(DB1.table1 -> DWH.table and DB2.table1->DWH.table) 

Now, in order to do this in realtime, I have 100 different databases, all these databases have their own table which needs to be loaded into a single table at DWH.

Do I need to create 100 different processors and configure each to connect to different database ?

or is there an alternative way to deal with 100 databases ?

If there is an alternative can someone help me understanding the solution and a sample xml file.. much appreciated..

Also I am looking for your suggestions on how to do the same CDC for 15 other tables of 100 databases..

1 ACCEPTED SOLUTION

avatar

@SandyClouds ofcourse you can,  you will need to use CLI:

 

https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html

 

Ofcourse you can do all this manually, but creating 100 flows one at a time would be a nitemare to create and manage.. 

View solution in original post

6 REPLIES 6

avatar

@SandyClouds Since CaptureChangeMysql does not accept incoming flowfiles you would have to create a separate CaptureChangeMysql flow for each database.  This would be exhaustive on 100s of manually created flows.  I would have to automate the creation of these flows in a normal nifi environment.

 

In some of the CDC demos I am working with now,  i am using CDP Public Cloud Data Flow and able to paramaterize my flows and create flow definitions that are very dynamic then deploy them without touching NiFi or NiFi API and only changing the parameters.   This would be an interesting approach to solving 100s of data flows.   

 

Another approach I use to begin CDC conversations without actually using "CDC" is to use ExecuteSQL which does accept incoming flowfiles.  I use this to grab an entire database and all tables, put all the data in kafka, and all the schema in Schema Registry.  This enables me to start to build deeper CDC logic around the same data, including using the same kafka topic per table for updates from other systems, then ultimately deciding how to handle deletes.  

 

That said, the complications in creating and operating these types of flows are only enabled when you begin to start capturing data.  To me CDC is a journey, not a destination, and we can only begin down that path when we start putting the above concepts in place whether that is CaptureChangeMysql or ExecuteSql with deeper logic for handling changes of the source database.

 

avatar
Expert Contributor

Hi @steven-matison Thanks much for your answer and insights.

Regarding ExecuteSQL solution, I had previously tried using Querydatabase table and fetch incremental records, everything was good until I realized our upstream is doing hard deletes which is really hard to sync or keep track of whats deleted. I guess execute SQL would also land me there.. CDC is working perfect on 2 test DBs..

 

Coming to the CDP public cloud, Our startup is looking for open source solutions  and doing it on-premise.

Do we have any option to parameterize CDC flows it in the  NiFi itself ? (Also i am a newbie to Nifi)

Once i do the flow for a single table from 100 databases, I need to do the same for 12 other tables.. the similar flow..;

 

Once again Thanks much for your response ✌🏻

 

avatar

@SandyClouds ofcourse you can,  you will need to use CLI:

 

https://nifi.apache.org/docs/nifi-docs/html/toolkit-guide.html

 

Ofcourse you can do all this manually, but creating 100 flows one at a time would be a nitemare to create and manage.. 

avatar
Expert Contributor

Thanks @steven-matison , I am exploring CLI now.

An extension to my above scenario, how do I know,  how many NiFi instances i need.. 

My sole purpose is to connect to 100 different databases (mysql) and apply CDC changes to the 101th DB (which is a DWH).. you can imagine the new updates/deletes/inserts are minimal (for all 100 instances combined some 5k per day) also there is no much to transform the data.. just fetch and push as is. @steven-matison . Currently I installed NiFi on the same server as DWH mysql is running.. (it is huge with 64GB RAM and 400 GB storage)

avatar

avatar
Expert Contributor

Thank you @steven-matison  This is really helpful.. /\ 

And If you allow me, I have a different question regarding Nifi CaptureChangeMySQL..

the username in this processor, should be the one with highest privileges/ root user ?

because it is not mentioned anywhere in the help section (https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-cdc-mysql-nar/1.5.0/org.apach... )

 

And when i provide a user credentials who has GRANT ALL PRIVILEGES ON *.*  then it is able to access CDC bin logs

But getting permission denied when trying to access with a user who has only read permissions on *.*

 

Since the processor has nothing to do with write, i thought only read permissions were enough, but looks like its not the case.. Can you please throw some light in this regard..