We are having challenges in getting effectively monitoring our workflows/jobs in our cluster. We run many cascading based workflows (which typically run for many hours) and these workflows spawn multiple jobs. When one of these workflows/jobs fail (due to data error or a code error) there are only 2 ways to catch this: 1.Constantly monitoring the Hue console for any failed workflow or jobs a.Challenges here – Somebody has to constantly keep looking at Hue and gets tedious and prone to misses for jobs that run for
a 5/6 hours or more 2.Depending on the Status email (success/failure) that is sent out at the end of the workflow. a.Challenges here – With a lot of success emails being sent out, people tend to miss that one off failure email. This is proving to be a challenge as we are finding that jobs are failed, sometimes, even a week later. Is there a way to have a custom dashboard view in HUE which shows only the failed jobs or workflows (say for the past 1 week)? Or are there any other ways to achieve effective monitoring (either through Hue or otherwise).
Thanks for your reply. There are the following challenges using this:
1. The usability of this is pretty bad. Even if we click on "Failed", it shows the failed jobs for some time and then auto-refreshes to 'alll' jobs.
2. Somebody has to keep coming to this view and keep watching it.
3. This view also slows down as the number of jobs grows. If we have a 100 workflows running on a daily basis, and this view will have multiples of 100s of jobs - and hence will become very slow. Because of this we have been forced to set limit on the number of rows show in this view.
Is there a simple way of having a custom view on the workflow/jobs dashboard ?
Any findings on this. Even i have the same problem monitoring the many application jobs. Any idea of creation f custom dashboard for failed jobs or workflows?