About ExFabrica

ExFabrica · ‎07-09-2021

Hello, We are currently trying to fetch changes from SQL Server CDC tables and push them into a Kafka topic with Nifi. Basically the flow is working but we would like to have feedback on how to make it more reliable and optimized. Here is the SQL used to fetch changes from MY_TABLE from offset 37979520 SELECT CT.SYS_CHANGE_VERSION, CASE CT.SYS_CHANGE_OPERATION WHEN 'D' THEN CT.T_PK ELSE NULL END as DeletedPK, CHANGE_TRACKING_CURRENT_VERSION() as LastTransactionId, T.* FROM CHANGETABLE(CHANGES MY_TABLE, 37979520) CT LEFT JOIN MY_TABLE T on CT.T_PK = T.T_PK Here is the Nifi implementation: (bigger picture in attachment) To avoid threading issues, all processors are targeting the primary node with a single thread. Here are the steps 1 - GenerateFlowFile To start the process, I generate a FlowFile with a custom property "increment.value" = cache_key. 2 - FetchDistributedMapCache I fetch the distributed cache with cache identifier $(increment.value} and I put the result in the "stored.state" attribute. 3 - UpdateAttribute To manage initialization, I update "stored.state" with the following expression: ${stored.state:isNull():ifElse(0, ${stored.state})} => If the cache is impty, I start from 0. 4 - ExecuteSQL SQL statement to fetch data from the current offset SELECT CT.SYS_CHANGE_VERSION, CASE CT.SYS_CHANGE_OPERATION WHEN 'D' THEN CT.T_PK ELSE NULL END as DeletedPK, CHANGE_TRACKING_CURRENT_VERSION() as LastTransactionId, T.* FROM CHANGETABLE(CHANGES MY_TABLE, ${stored.state}) CT LEFT JOIN MY_TABLE T on CT.T_PK = T.T_PK 5 - QueryRecord The "Include Zero Record FlowFiles" is set to false to avoid dealing with incoming empty datasets. The relation selectLastTransactionId fetch the last offset with, select max(LastTransactionId) as last_transaction_id from flowfile The relation selectData forward non-empty FlowFiles to Kafka select * from flowfile 6A - EvaluateJsonPath I get $.[0].last_transaction_id to forward the next offet to PutDistributedMapCache processor. 6B - PublishKafkaRecord 7A - PutDistributedMapCache I update the cache with the new offset. ------------------------ Here are some questions, - Is there a better way to safely start the process ? A GenerateFlowFile with a long Timer Driver Run Schedule is not ideal. - Despite the (Include Zero Record FlowFiles = false) of the QueryRecord, I have some activity for empty datasets coming from the ExecuteSQL processor. It triggers many exceptions at PutDistributedMapCache level, How is this possible ? - How can we make this flow more reliable ? Is there a better way to orchestrate such a flow ? Many thanks !

Online	Offline
Last Visited	‎10-22-2021 02:14 PM

Member Since	‎07-09-2021 04:00 AM
Last Visited	‎10-22-2021 02:14 PM
Posts	1

Cloudera Community

CDC-like Nifi flow from SQL Server to Kafka