Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Where to store Apache Spark batch ETL job properties



I want to build batch-based ETL from RDBMS "SQL Server" Using apache spark . My Spark cluster is running part of the Cloudera Application.
My question is Where should I store the ETL job watermark for example the maximum TIMESTAMP so the next job will get the records which have a bigger timestamp in the next batch run?

Should I use a Hive table Or there is a better approach to store this data so it can be used in the next jobs?


Master Collaborator

Hi @KhASQ 


For Watermarking use any framework/db to update values once job is successfully. If you are using kafka then kafka itself you can store kafka related watermarking. Other than kafka you want to use then choose any RDBMS or HBase table.