Created 03-23-2022 01:40 PM
I need help understanding the root cause of a job failure in Cloudera Data Science Workbench.
What does it mean when a job is stuck in "Scheduling" phase and terminates as "Failed" after the timeout period? The log file, which usually displays the python code followed by the results from print statements and/or run time errors, states only a single line reading "Engine exited with status 34". Apparently, I was able to re-run the job successfully after a while. It could be a temporary glitch in the system but still trying to understand the root cause of this issue. Any insight on what happens during the "Job Scheduling" is much appreciated.
Created 02-01-2023 09:19 AM
I agree with Smarak, the error code typically means that there were not enough resources available to start the job. You could probably use Grafana dashboard (available in the Admin page) to look at the cluster resources and load around the time you had this issue. Is it happening consistently?
For Jobs, I usually see this at the start or end of the month, when a lot of people schedule periodic jobs and they all trigger at the same time.
Created 03-23-2022 05:00 PM
@MadhuNP ,
It's hard to say without looking at the logs. If the job failed the log should show some exception messages.
If not, more investigation would be required to get to the root case, IMO.
André
Created 03-24-2022 09:49 AM
Log file just had a single line stating "Engine exited with status 34".
Created on 04-08-2022 12:46 AM - edited 04-08-2022 01:01 AM
Hello @MadhuNP
Thanks for using Cloudera Community. We see your Team is working with our Support Team for the concerned issue. Based on the Support engagement, We shall update the Post accordingly.
Regards, Smarak
Created 01-24-2023 10:47 PM
Hello, @smdas ,
do you know any cause or solution for this issue? We also faced the same problem.
Thank you for your answer.
Regards, Ellen
Created 01-24-2023 11:30 PM
Hello @Ellengogo
I am not able to find the exact Case reference as this Post is few months old. Typically, such issues are caused by Resource Constraint, wherein the Engine Pod (Created when a Workbench Session is started) gets terminated owing to Resource Constraint. If the issue is persistent, a Support Case would be ideal as we require the review of the Kubernetes Output pertaining to the Engine Pod along with Resource Profile & other related artefacts.
Regards, Smarak
Created 02-01-2023 09:19 AM
I agree with Smarak, the error code typically means that there were not enough resources available to start the job. You could probably use Grafana dashboard (available in the Admin page) to look at the cluster resources and load around the time you had this issue. Is it happening consistently?
For Jobs, I usually see this at the start or end of the month, when a lot of people schedule periodic jobs and they all trigger at the same time.