Created on 03-26-2024 03:40 AM - edited 03-26-2024 03:41 AM
Hello,
I am in the process of transferring data from an Oracle database to HDFS, formatted as Parquet. The workflow, as depicted in the attached screenshot, employs the GeneratedTableFetch method for data ingestion in segments. Additionally, the ExecuteSQL processor runs the generated queries, UpdateAttribute is utilized to add an attribute, and QueryRecord is used to introduce a new column in the flow files.
The source table contains approximately 22 million records. In this procedure, the 'Date' column is set as the 'Maximum-value Column' with the GenerateTableFetch processor, and a partition size of one million rows has been configured. This approach has allowed for the successful transfer and storage of all 22 million records into HDFS.
However, during data quality checks, I encountered some issues: there were missing rows and instances of duplicate records in the HDFS storage, a situation not mirrored in the source database, which maintained an accurate and duplicate-free record count.
Could you assist in pinpointing the potential reasons for these discrepancies?
Created 03-26-2024 07:33 AM
Welcome to the community @arbenosm.
I tried to find you some resources to look over while waiting for an expert to respond, but didn't see any exact matches. I did see a couple of mentions however stating you should ensure you choose a timestamp column in your Oracle table that accurately reflects updates to insure you only fetch data that has changed since the last successful run. Hopefully that is helpful. Otherwise, maybe @steven-matison or @MattWho may have some ideas.
Created 03-26-2024 02:41 PM
the 'date' column in our table is indeed of the TIMESTAMP data type.