Hello
I want to build batch-based ETL from RDBMS "SQL Server" Using apache spark . My Spark cluster is running part of the Cloudera Application.
My question is Where should I store the ETL job watermark for example the maximum TIMESTAMP so the next job will get the records which have a bigger timestamp in the next batch run?
Should I use a Hive table Or there is a better approach to store this data so it can be used in the next jobs?