Support Questions

Find answers, ask questions, and share your expertise

Tez: vertex failed due to it's own failure;DAG did not succeed due to VERTEX_FAILURE.

avatar
New Contributor

Hello,

I am using HDP 2.6 version. When I run a query on Hive tables Tez job failed, killed/failed due to:OWN_TASK_FAILURE.

Error do not have much information, and cannot able to resolve this issue. Looks like similar issue has been experienced by many people, but I could not find a solution in HWX xommunity.

failed for even basic queries like

select max(column1) from table;

select count(*) from table;

insert into table(data merge) with large number of records.

Please help out, Any inputs can be appriciated.

Error is something like:

--------------------------------------------------------------------------------
        VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED
--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      2          2        0        0       7       0
Reducer 2            RUNNING      1          0        1        0       0       0
--------------------------------------------------------------------------------
VERTICES: 01/02  [=================>>---------] 66%   ELAPSED TIME: 11.22 s
--------------------------------------------------------------------------------
Status: Failed
Vertex re-running, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00
Vertex re-running, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00
Vertex failed, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00, diagnostics=[Vertex vertex_1507606976106_0043_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1507606976106_0043_2_00 [Map 1] failed as task task_1507606976106_0043_2_00_000001 failed after vertex succeeded.]
DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00, diagnostics=[Vertex vertex_1507606976106_0043_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1507606976106_0043_2_00 [Map 1] failed as task task_1507606976106_0043_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
expr: syntax error
6 REPLIES 6

avatar
Rising Star

The Yarn application log should be able to provide more insight regarding the error. You can gather Yarn log using below command:

yarn logs -applicationId application_1507606976106_0043

avatar
New Contributor

I could not find much from logs.

 [DEV  ~]$ yarn logs -applicationId application_1507606976106_0043
17/10/12 17:26:47 INFO client.AHSProxy: Connecting to Application History server at ........
17/10/12 17:26:47 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]...
17/10/12 17:26:47 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2]
/app-logs/u0456/logs/application_1507606976106_0043 does not have any log files.
Can not find the logs for the application: application_1507606976106_0043 with the appOwner: u0456

avatar
Rising Star

The Yarn application log should be able to provide more insight regarding the error. You can gather Yarn log using below command:

yarn logs -applicationId application_1507606976106_0043

avatar
Master Mentor

@D G

Would you be able to find the task attempt that actually failed. That task attempt can show you which machine and YARN container is ran on. Sometimes the logs dont have the error because it logged into stderr. In that case, the stderr from the containers YARN logs may show the error.

Could you set this variable and re-run the query

set hive.execution.engine=tez; 
set hive.auto.convert.join=true; 
set hive.auto.convert.join.noconditionaltask=true; 
set hive.auto.convert.join.noconditionaltask.size=405306368; 
set hive.vectorized.execution.enabled=true; 
set hive.vectorized.execution.reduce.enabled =true; 
set hive.cbo.enable=true; 
set hive.compute.query.using.stats=true; 
set hive.stats.fetch.column.stats=true; 
set hive.stats.fetch.partition.stats=true; 
set hive.merge.mapfiles =true; 
set hive.merge.mapredfiles=true; 
set hive.merge.size.per.task=134217728; 
set hive.merge.smallfiles.avgsize=44739242; 
set mapreduce.job.reduce.slowstart.completedmaps=0.8

Please let me know if that helped

avatar
Expert Contributor

hi @D G , have you fixed this problem ?


avatar
Expert Contributor

hi,

i am having the same problem after upgrading from HDP 2.6.2 to HDp 3.1, although i have alot of resources in the cluster, when i ran a query (select count(*) from table) if the table is small (3k records) it ran successfully, if the table is larger (50k) records i am getting the same vertex failure error, i checked the yarn application log for the failed query i get the below error



2019-05-14 11:58:14,823 [INFO] [TezChild] |tez.ReduceRecordProcessor|: Starting Output: out_Reducer 2
2019-05-14 11:58:14,828 [INFO] [TezChild] |compress.CodecPool|: Got brand-new decompressor [.snappy]
2019-05-14 11:58:18,466 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Routing events from heartbeat response to task, currentTaskAttemptId=attempt_1557754551780_0137_1_01_000000_0, eventCount=1 fromEventId=1 nextFromEventId=2
2019-05-14 11:58:18,488 [INFO] [Fetcher_B {Map_1} #1] |HttpConnection.url|: for url=http://myhost_name.com:13562/mapOutput?job=job_1557754551780_0137&dag=1&reduce=0&map=attempt_1557754551780_0137_1_00_000000_0_10002 sent hash and receievd reply 0 ms
2019-05-14 11:58:18,491 [INFO] [Fetcher_B {Map_1} #1] |shuffle.Fetcher|: Failed to read data to memory for InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, pathComponent=attempt_1557754551780_0137_1_00_000000_0_10002, spillType=0, spillId=-1]. len=28, decomp=14. ExceptionMessage=Not a valid ifile header
2019-05-14 11:58:18,492 [WARN] [Fetcher_B {Map_1} #1] |shuffle.Fetcher|: Failed to shuffle output of InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, pathComponent=attempt_1557754551780_0137_1_00_000000_0_10002, spillType=0, spillId=-1] from myhost_name.com
java.io.IOException: Not a valid ifile header
	at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.verifyHeaderMagic(IFile.java:859)
	at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.isCompressedFlagEnabled(IFile.java:866)
	at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.readToMemory(IFile.java:616)
	at org.apache.tez.runtime.library.common.shuffle.ShuffleUtils.shuffleToMemory(ShuffleUtils.java:121)
	at org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:950)
	at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599)
	at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486)
	at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284)
	at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76)
	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


both queries were working fine before upgrade, the only change i made after the upgrade is increasing the heapsize of the data nodes, i also followed @Geoffrey Shelton Okot configuration but still same error.

thanks