Support Questions

arbenosm · ‎03-26-2024

Hello,

I am in the process of transferring data from an Oracle database to HDFS, formatted as Parquet. The workflow, as depicted in the attached screenshot, employs the GeneratedTableFetch method for data ingestion in segments. Additionally, the ExecuteSQL processor runs the generated queries, UpdateAttribute is utilized to add an attribute, and QueryRecord is used to introduce a new column in the flow files.

The source table contains approximately 22 million records. In this procedure, the 'Date' column is set as the 'Maximum-value Column' with the GenerateTableFetch processor, and a partition size of one million rows has been configured. This approach has allowed for the successful transfer and storage of all 22 million records into HDFS.

However, during data quality checks, I encountered some issues: there were missing rows and instances of duplicate records in the HDFS storage, a situation not mirrored in the source database, which maintained an accurate and duplicate-free record count.

Could you assist in pinpointing the potential reasons for these discrepancies?

cjervis · ‎03-26-2024

Welcome to the community @arbenosm.

I tried to find you some resources to look over while waiting for an expert to respond, but didn't see any exact matches. I did see a couple of mentions however stating you should ensure you choose a timestamp column in your Oracle table that accurately reflects updates to insure you only fetch data that has changed since the last successful run. Hopefully that is helpful. Otherwise, maybe @steven-matison or @MattWho may have some ideas.

Cy Jervis, Manager, Community Program
Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.

arbenosm · ‎03-26-2024

the 'date' column in our table is indeed of the TIMESTAMP data type.

Cloudera Community

Support Questions

Duplicated rows are being generated during the use of GenerateTableFetch in incremental mode.