Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here. Want to know more about what has changed? Check out the Community News blog.
Summary: -------- A data processing task (spark job) with 3 sub-stages: stage1, stage2, stage3 which process the data by reading rules dynamically from a metadata store. Input data is fed to stage1, the output of which is input to stage2 and the output of stage2 is input to stage3. The rules might change and in that case we need to recompute the processing. But since each step is time consuming, if metadata for stage2 changes, we want to trigger the computation from stage2 and not stage1.
Metadata -------- Rules are created by users which are stored as metadata. The are separate rules for stage1, stage2 and stage3. The metadata might reside in a Oracle table.
Data ---- Data is ingested to hadoop. On completion of the data load, the load details are inserted in a tracking table with status = 'COMPLETED'.
Processing ---------- The data processing stage is a spark job has 3 sub-stages: stage1, stage2, stage3. The processing engine reads the tracking table and once the status='COMPLETED', it reads the relevant data as mentioned in the tracking table and starts the processing. Now on rule change this process engine needs to be triggered automatically from the appropriate stage. e.g. if stage2 metadata changes, the computation should be triggered and only stage2, stage 3 should be processed.