Support Questions

Find answers, ask questions, and share your expertise

Where to store Apache Spark batch ETL job properties

avatar
Explorer

Hello


I want to build batch-based ETL from RDBMS "SQL Server" Using apache spark . My Spark cluster is running part of the Cloudera Application.
My question is Where should I store the ETL job watermark for example the maximum TIMESTAMP so the next job will get the records which have a bigger timestamp in the next batch run?

Should I use a Hive table Or there is a better approach to store this data so it can be used in the next jobs?

1 REPLY 1

avatar
Master Collaborator

Hi @KhASQ 

 

For Watermarking use any framework/db to update values once job is successfully. If you are using kafka then kafka itself you can store kafka related watermarking. Other than kafka you want to use then choose any RDBMS or HBase table.