Support Questions

Find answers, ask questions, and share your expertise

Nifi QueryDatabaseTable processor leads to data loss

avatar
Explorer

While running the querydatabase processor in NiFi some records are getting lost.

I am try to pull the data from Oracle DB and the maximum-column-value is a created Date of the date datatype. There is no timestamp attached to that(23-JUN-18) . While triggering the job every 5 mins to achieve near-real time ingestion some records are getting lost. I am not sure whether this is happening due to time stamp not present in the table OR whether this can be a synchronization issue like while querying the DB if new records got inserted.

1 ACCEPTED SOLUTION

avatar
Master Guru

In general, QueryDatabaseTable (QDT) must be scheduled at a rate commensurate with the values in the Maximum Value Column. Using a Maximum Value Column of type Date (with no timestamp) means you should only check for new values (i.e. schedule the QDT processor) each day, not every 5 minutes. Otherwise how would the processor know to grab rows with the current day that have come in since the last time it checked? It only looks at the date and (hypothetically) would either grab all the data with that value for the Date, leading to duplicate data, or (as it actually behaves) would look for data with a value of the next day, which it won't find until the following day, also leading to data loss.

Possible workarounds are to schedule QDT to run once a day, or add/alter a column to include timestamp and schedule QDT accordingly, or use a different column (possibly via a DB view) that is always increasing for each new row as the Maximum Value Column.

Hopefully someday there will also be an Oracle CDC processor (likely leveraging LogMiner) to fetch changes in near-real-time, there may be a way to use QDT and Oracle/LogMiner tables to emulate this capability, if the changes are available in a table with the same qualities (i.e. a Maximum Value Column that is always increasing w.r.t. to the Run Schedule of QDT).

View solution in original post

2 REPLIES 2

avatar
Master Guru

In general, QueryDatabaseTable (QDT) must be scheduled at a rate commensurate with the values in the Maximum Value Column. Using a Maximum Value Column of type Date (with no timestamp) means you should only check for new values (i.e. schedule the QDT processor) each day, not every 5 minutes. Otherwise how would the processor know to grab rows with the current day that have come in since the last time it checked? It only looks at the date and (hypothetically) would either grab all the data with that value for the Date, leading to duplicate data, or (as it actually behaves) would look for data with a value of the next day, which it won't find until the following day, also leading to data loss.

Possible workarounds are to schedule QDT to run once a day, or add/alter a column to include timestamp and schedule QDT accordingly, or use a different column (possibly via a DB view) that is always increasing for each new row as the Maximum Value Column.

Hopefully someday there will also be an Oracle CDC processor (likely leveraging LogMiner) to fetch changes in near-real-time, there may be a way to use QDT and Oracle/LogMiner tables to emulate this capability, if the changes are available in a table with the same qualities (i.e. a Maximum Value Column that is always increasing w.r.t. to the Run Schedule of QDT).

avatar
Explorer

Thanks Matt.. your explanation does make sense.