Created 10-12-2017 07:59 PM
Hello,
I am using HDP 2.6 version. When I run a query on Hive tables Tez job failed, killed/failed due to:OWN_TASK_FAILURE.
Error do not have much information, and cannot able to resolve this issue. Looks like similar issue has been experienced by many people, but I could not find a solution in HWX xommunity.
failed for even basic queries like
select max(column1) from table;
select count(*) from table;
insert into table(data merge) with large number of records.
Please help out, Any inputs can be appriciated.
Error is something like:
-------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 2 2 0 0 7 0 Reducer 2 RUNNING 1 0 1 0 0 0 -------------------------------------------------------------------------------- VERTICES: 01/02 [=================>>---------] 66% ELAPSED TIME: 11.22 s -------------------------------------------------------------------------------- Status: Failed Vertex re-running, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00 Vertex re-running, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00 Vertex failed, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00, diagnostics=[Vertex vertex_1507606976106_0043_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1507606976106_0043_2_00 [Map 1] failed as task task_1507606976106_0043_2_00_000001 failed after vertex succeeded.] DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1507606976106_0043_2_00, diagnostics=[Vertex vertex_1507606976106_0043_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1507606976106_0043_2_00 [Map 1] failed as task task_1507606976106_0043_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 expr: syntax error
Created 10-12-2017 09:23 PM
The Yarn application log should be able to provide more insight regarding the error. You can gather Yarn log using below command:
yarn logs -applicationId application_1507606976106_0043
Created 10-12-2017 09:32 PM
I could not find much from logs.
[DEV ~]$ yarn logs -applicationId application_1507606976106_0043 17/10/12 17:26:47 INFO client.AHSProxy: Connecting to Application History server at ........ 17/10/12 17:26:47 INFO client.RequestHedgingRMFailoverProxyProvider: Looking for the active RM in [rm1, rm2]... 17/10/12 17:26:47 INFO client.RequestHedgingRMFailoverProxyProvider: Found active RM [rm2] /app-logs/u0456/logs/application_1507606976106_0043 does not have any log files. Can not find the logs for the application: application_1507606976106_0043 with the appOwner: u0456
Created 10-12-2017 09:23 PM
The Yarn application log should be able to provide more insight regarding the error. You can gather Yarn log using below command:
yarn logs -applicationId application_1507606976106_0043
Created 10-12-2017 09:46 PM
@D G
Would you be able to find the task attempt that actually failed. That task attempt can show you which machine and YARN container is ran on. Sometimes the logs dont have the error because it logged into stderr. In that case, the stderr from the containers YARN logs may show the error.
Could you set this variable and re-run the query
set hive.execution.engine=tez; set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask=true; set hive.auto.convert.join.noconditionaltask.size=405306368; set hive.vectorized.execution.enabled=true; set hive.vectorized.execution.reduce.enabled =true; set hive.cbo.enable=true; set hive.compute.query.using.stats=true; set hive.stats.fetch.column.stats=true; set hive.stats.fetch.partition.stats=true; set hive.merge.mapfiles =true; set hive.merge.mapredfiles=true; set hive.merge.size.per.task=134217728; set hive.merge.smallfiles.avgsize=44739242; set mapreduce.job.reduce.slowstart.completedmaps=0.8
Please let me know if that helped
Created 05-14-2019 01:06 PM
hi @D G , have you fixed this problem ?
Created 05-15-2019 02:59 PM
hi,
i am having the same problem after upgrading from HDP 2.6.2 to HDp 3.1, although i have alot of resources in the cluster, when i ran a query (select count(*) from table) if the table is small (3k records) it ran successfully, if the table is larger (50k) records i am getting the same vertex failure error, i checked the yarn application log for the failed query i get the below error
2019-05-14 11:58:14,823 [INFO] [TezChild] |tez.ReduceRecordProcessor|: Starting Output: out_Reducer 2 2019-05-14 11:58:14,828 [INFO] [TezChild] |compress.CodecPool|: Got brand-new decompressor [.snappy] 2019-05-14 11:58:18,466 [INFO] [TaskHeartbeatThread] |task.TaskReporter|: Routing events from heartbeat response to task, currentTaskAttemptId=attempt_1557754551780_0137_1_01_000000_0, eventCount=1 fromEventId=1 nextFromEventId=2 2019-05-14 11:58:18,488 [INFO] [Fetcher_B {Map_1} #1] |HttpConnection.url|: for url=http://myhost_name.com:13562/mapOutput?job=job_1557754551780_0137&dag=1&reduce=0&map=attempt_1557754551780_0137_1_00_000000_0_10002 sent hash and receievd reply 0 ms 2019-05-14 11:58:18,491 [INFO] [Fetcher_B {Map_1} #1] |shuffle.Fetcher|: Failed to read data to memory for InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, pathComponent=attempt_1557754551780_0137_1_00_000000_0_10002, spillType=0, spillId=-1]. len=28, decomp=14. ExceptionMessage=Not a valid ifile header 2019-05-14 11:58:18,492 [WARN] [Fetcher_B {Map_1} #1] |shuffle.Fetcher|: Failed to shuffle output of InputAttemptIdentifier [inputIdentifier=0, attemptNumber=0, pathComponent=attempt_1557754551780_0137_1_00_000000_0_10002, spillType=0, spillId=-1] from myhost_name.com java.io.IOException: Not a valid ifile header at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.verifyHeaderMagic(IFile.java:859) at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.isCompressedFlagEnabled(IFile.java:866) at org.apache.tez.runtime.library.common.sort.impl.IFile$Reader.readToMemory(IFile.java:616) at org.apache.tez.runtime.library.common.shuffle.ShuffleUtils.shuffleToMemory(ShuffleUtils.java:121) at org.apache.tez.runtime.library.common.shuffle.Fetcher.fetchInputs(Fetcher.java:950) at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:599) at org.apache.tez.runtime.library.common.shuffle.Fetcher.doHttpFetch(Fetcher.java:486) at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:284) at org.apache.tez.runtime.library.common.shuffle.Fetcher.callInternal(Fetcher.java:76) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108) at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41) at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
both queries were working fine before upgrade, the only change i made after the upgrade is increasing the heapsize of the data nodes, i also followed @Geoffrey Shelton Okot configuration but still same error.
thanks