Support Questions

KuldeepK · ‎02-07-2017

Is there any way to skip checking for jobHistory server to avoid Oozie job failures. Sometimes when JobHistory server is down or getting restarted, Oozie job gets failed with an error like "Unknown job id" error ( When it fails to connect to JobHistory server, which is expected )

Any idea how can we workaround on this error? Any timeout parameter?

KuldeepK · ‎02-09-2017

I did the bit of research and looked into the code and found that currently there is no TIMEOUT parameter on Oozie level. I have raised an internal enhancement request for this.

##Snipped from JavaActionExecutor.java##

try { 
Element actionXml = XmlUtils.parseXml(action.getConf()); 
FileSystem actionFs = context.getAppFileSystem(); 
JobConf jobConf = createBaseHadoopConf(context, actionXml); 
jobClient = createJobClient(context, jobConf); 
RunningJob runningJob = getRunningJob(context, action, jobClient); 
if (runningJob == null) { 
context.setExecutionData(FAILED, null); 
throw new ActionExecutorException(ActionExecutorException.ErrorType.FAILED, "JA017", 
"Unknown hadoop job [{0}] associated with action [{1}]. Failing this action!", action 
.getExternalId(), action.getId()); 
}

protected RunningJob getRunningJob(Context context, WorkflowAction action, JobClient jobClient) throws Exception{ 
RunningJob runningJob = jobClient.getJob(JobID.forName(action.getExternalId())); 
return runningJob; 
}

##Snippet from Mapreduce code(JobClient.java)##

public RunningJob getJob(JobID jobid) throws IOException { 
JobStatus status = jobSubmitClient.getJobStatus(jobid); 
JobProfile profile = jobSubmitClient.getJobProfile(jobid); 
if (status != null && profile != null) { 
return new NetworkedJob(status, profile, jobSubmitClient); 
} else { 
return null; 
} 
}

##Snippet from JobSubmissionProtocol.java (mapreduce code)##

/** 
* Grab a handle to a job that is already known to the JobTracker. 
* @return Status of the job, or null if not found. 
*/ 
public JobStatus getJobStatus(JobID jobid) throws IOException;

So I got answer to my question! 🙂

View solution in original post

KuldeepK · ‎02-07-2017

@Venkat Ranganathan

@peeyush

@Abhish ek Bafna

KuldeepK · ‎02-07-2017

@dgoodhand

KuldeepK · ‎02-07-2017

@dgoodhand

pminovic · ‎02-07-2017

Maybe you try troubleshooting your Job History server instead. It's not supposed to be down. A busy Job History server requires a reasonable amount of memory, like 8G or more.

KuldeepK · ‎02-08-2017

@Predrag Minovic - Yes that's correct. but if JHS is being restarted for some reason and Oozie tries to connect to JHS, jobs will get failed. I'm looking for timeout parameter which can hold jobs until JHS is back

KuldeepK · ‎02-09-2017