Support Questions

KhASQ · ‎06-24-2021

Hello

I want to build batch-based ETL from RDBMS "SQL Server" Using apache spark . My Spark cluster is running part of the Cloudera Application.
My question is Where should I store the ETL job watermark for example the maximum TIMESTAMP so the next job will get the records which have a bigger timestamp in the next batch run?

Should I use a Hive table Or there is a better approach to store this data so it can be used in the next jobs?

RangaReddy · ‎06-28-2021

Hi @KhASQ

For Watermarking use any framework/db to update values once job is successfully. If you are using kafka then kafka itself you can store kafka related watermarking. Other than kafka you want to use then choose any RDBMS or HBase table.

Cloudera Community

Support Questions

Where to store Apache Spark batch ETL job properties