If a YARN job fails due to a cluster crashing. How do we know the status? What is the best way to prepare for this?
What best practices around DR should we do to maintain integrity of Oozie, Spark, Hive, HBase jobs.
We want to know if jobs completed successfully / failed, and any other information available.
Were writes consistent, etc..
Data lost? Can / should each job be rerun.
We don't want to lose data, corrupt data or rerun jobs that don't need to be run. Rerunning some jobs may be an issue as not all are idempotent.
Data ingest jobs can't be rerun.
Some analytics jobs can be.
Anything cloud specific would be helpful as well.