Summary: -------- A data processing task (spark job) with 3 sub-stages: stage1, stage2, stage3 which process the data by reading rules dynamically from a metadata store. Input data is fed to stage1, the output of which is input to stage2 and the output of stage2 is input to stage3. The rules might change and in that case we need to recompute the processing. But since each step is time consuming, if metadata for stage2 changes, we want to trigger the computation from stage2 and not stage1. Metadata -------- Rules are created by users which are stored as metadata. The are separate rules for stage1, stage2 and stage3. The metadata might reside in a Oracle table. Data ---- Data is ingested to hadoop. On completion of the data load, the load details are inserted in a tracking table with status = 'COMPLETED'. Processing ---------- The data processing stage is a spark job has 3 sub-stages: stage1, stage2, stage3. The processing engine reads the tracking table and once the status='COMPLETED', it reads the relevant data as mentioned in the tracking table and starts the processing. Now on rule change this process engine needs to be triggered automatically from the appropriate stage. e.g. if stage2 metadata changes, the computation should be triggered and only stage2, stage 3 should be processed. How to achieve this automatic triggering?
... View more
Requirement: Trigger a spark job from UI by user action (say submit button click). Once the spark job is finished, the summary of the status has to be displayed in UI. Design approach: 1. Once the user initiates a job run by clicking the submit button from UI, we will make an insert into a Impala queue table using Impala JDBC. The simplified structure of the queue table is as follows: JOB_RUN_QUEUE (REQUEST_ID, STATUS, INPUT_PARAM_1, INPUT_PARAM_2, SUMMARY) The initial request will have STATUS='SUBMIT' 2. Oozie will be configured to orchestrate the request handling and job spark execution. Once Oozie finds entry into the queue table JOB_RUN_QUEUE with status='SUBMIT' it will pull the arguments from the queue table and trigger the spark job. It will update the status in the queue table to 'IN PROGRESS'. Upon successfull completion it will update the summary and status in the queue table. On failure it will update the status to FAILURE. 3. UI will read the data from the queue table and display on UI. Questions: 1. Is there any alternative ad better design approach. 2. D I need to have a queue mechanism to the initial request or can I leverage some inbuilt functionality?
... View more