Created on 05-08-2016 01:45 PM - edited 08-19-2019 01:47 AM
hi guys,
My cluster is using hdp-2.3.2.0, anaylse data with hive on tez.
Jobs always go well, but sometimes job can't not be finished, the log of the taskattempt is always 1.0, such like this "org.apache.hadoop.mapred.TaskAttemptListenerImpl: Progress of TaskAttempt attempt_1462372008131_4318_m_000000_0 is : 1.0", and it never change to another state if i do nothing.
And job progress on "All applications" ui shows 95%.
So I have to kill the job never finished by tez to avoid holding up next jobs, how can i do fix it , i have no idea with this problem
My sense is hive on tez job executed by oozie coordinators, 8 workflow action run concurrently at one time.
Please help . Thank you very much
tips about our cluster: node manager, datanode, regionserver, recource manager started on the same node
and attachement is more detail info about my issue and the container log, Thanks again
Created 06-12-2016 07:07 AM
Finally i find the way to solve this issue
It's beasue the yarn schedule assign cluster resource while working with oozie.
When oozie execute a workflow , oozie launcher will occupy 1 container, and then the occupation will set up the tez application for my oozie hive2 action.
But my cluster is less 40 nodes, it's a small cluster and oozie workflow task are a lot of.
This cause not enough containers can be used to complete the tez application, and oozie launcher will not release container until tez application completed, so the cluster resource lacked.
My solution is set 2 queue in yarn queue,one is the `default` queue, and another is `oozie ` only used to work for oozie launcher, and set `default` queue name to my oozie hive2 action
Thanks again for all of your replies, the inspire me
Created 05-09-2016 11:40 PM
It's not the ultimate solution, but you can reduce tez.session.am.dag.submit.timeout.secs to avoid job from being hold up for a long time.
Created 05-10-2016 03:50 AM
hi, thanks for reply, i appreciate it
Our settings is tez.session.am.dag.submit.timeout.secs=600 and tez.session.client.timeout.secs=-1
Does it make something wrong.
Always our jobs can be accomplished very soon , data is not too large, but job's progress keep 1.0 once in a while
Thanks agian
Created 05-10-2016 09:15 PM
you may want to set tez.session.client.timeout.secs to something like 180 other than -1 because -1 means it waits forever.
Also you may want to look into the application log to investigate why that job is being stuck.
Created 05-10-2016 09:29 PM
I think you are running out of disk space. Can you check that?
Created 06-12-2016 07:05 AM
thanks for reply,i got the final solution, i will push it here.
It's about the yarn schedule worked with oozie.
When oozie execute a workflow , oozie launcher will occupy 1 container, and then the occupation will set up the tez application for my oozie hive2 action.
But my cluster is less 40 nodes, it's a small cluster and oozie workflow task are a lot of.
This cause not enough containers can be used to complete the tez application, and oozie launcher will not release container until tez application completed, so the cluster resource lacked.
My solution is set 2 queue in yarn queue,one is the `default` queue, and another is `oozie ` only used to work for oozie launcher, and set `default` queue name to my oozie hive2 action
Thank again for all of your replies, the inspire me
Created 07-22-2016 03:43 AM
no,it's not about the disk space,yarn schedule it the key
Created 06-12-2016 07:07 AM
Finally i find the way to solve this issue
It's beasue the yarn schedule assign cluster resource while working with oozie.
When oozie execute a workflow , oozie launcher will occupy 1 container, and then the occupation will set up the tez application for my oozie hive2 action.
But my cluster is less 40 nodes, it's a small cluster and oozie workflow task are a lot of.
This cause not enough containers can be used to complete the tez application, and oozie launcher will not release container until tez application completed, so the cluster resource lacked.
My solution is set 2 queue in yarn queue,one is the `default` queue, and another is `oozie ` only used to work for oozie launcher, and set `default` queue name to my oozie hive2 action
Thanks again for all of your replies, the inspire me