Created 04-03-2024 07:33 AM
We recently upgraded to CDE 1.20.3. After the upgrade we encountered issues related to resource allocation. I suspect that the yunikorn is not properly functionining because jobs are waiting time even if there are available resources to distribute to the spark jobs.
Is there a yunikorn UI available from the CDE where we can access to monitor so we can easily monitor the yunikorn pods?
Created 04-05-2024 12:26 AM
Hello @Ging
Thanks for engaging Cloudera Community. Based on the Post, Your Team is seeing Resource Allocation Issues with Jobs remaining in Waiting State even if available resources & believe the Issue is linked with YuniKorn as the Issue is being observed after Upgrade to CDE v1.20.3.
To your Q, Please find the requested details below:
(I) YuniKorn UI: YuniKorn UI is available via CDE UI > Administration > CDE Service "Service Details" > Resource Scheduler
(II) For any assumed Issue with YuniKorn, Always Capture the YuniKorn StateDump [1] while the Issue is being Observed. This is Extremely Important as YuniKorn StateDump is Realtime & doesn't help, if the StateDump is captured while the Issue isn't being observed.
(III) Collect the YuniKorn Scheduler & YuniKorn Admission Controller Pod Logs in YuniKorn Namespace while the Issue is happening.
(IV) Attempt a Restart of YuniKorn Scheduler to confirm if the Issue persists after Capturing all above Info. If Yes i.e. Issue persists, Engage Cloudera Support with the StateDump, Pod Logs & Job Event Log, showing the Job being Stuck in Waiting.
Hope the above answers your ask.
- Smarak
[1] https://yunikorn.apache.org/docs/1.3.0/user_guide/troubleshooting/#obtain-full-state-dump
Created 04-05-2024 12:26 AM
Hello @Ging
Thanks for engaging Cloudera Community. Based on the Post, Your Team is seeing Resource Allocation Issues with Jobs remaining in Waiting State even if available resources & believe the Issue is linked with YuniKorn as the Issue is being observed after Upgrade to CDE v1.20.3.
To your Q, Please find the requested details below:
(I) YuniKorn UI: YuniKorn UI is available via CDE UI > Administration > CDE Service "Service Details" > Resource Scheduler
(II) For any assumed Issue with YuniKorn, Always Capture the YuniKorn StateDump [1] while the Issue is being Observed. This is Extremely Important as YuniKorn StateDump is Realtime & doesn't help, if the StateDump is captured while the Issue isn't being observed.
(III) Collect the YuniKorn Scheduler & YuniKorn Admission Controller Pod Logs in YuniKorn Namespace while the Issue is happening.
(IV) Attempt a Restart of YuniKorn Scheduler to confirm if the Issue persists after Capturing all above Info. If Yes i.e. Issue persists, Engage Cloudera Support with the StateDump, Pod Logs & Job Event Log, showing the Job being Stuck in Waiting.
Hope the above answers your ask.
- Smarak
[1] https://yunikorn.apache.org/docs/1.3.0/user_guide/troubleshooting/#obtain-full-state-dump
Created 04-14-2024 11:47 PM
Hello @Ging
We hope the above Post answers your Q. We shall mark the Post as Resolved now. If you have any further Q, Feel free to Comment & We shall get back to you on the same.
- Smarak