Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hive - tez , vertex failed error during reduce phase - HDP 3.1.0.0-78

avatar
Expert Contributor

Hi,

i am running HDP 3.1 (3.1.0.0-78) , i have 10 data nodes , Hive execution engine is TEZ, when i run a query i get this error

ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00, diagnostics=[Vertex vertex_1557754551780_1091_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1557754551780_1091_2_00 [Map 1] failed as task task_1557754551780_1091_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
INFO  : Completed executing command(queryId=hive_20190516161715_09090e6d-e513-4fcc-9c96-0b48e9b43822); Time taken: 17.935 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00, diagnostics=[Vertex vertex_1557754551780_1091_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1557754551780_1091_2_00 [Map 1] failed as task task_1557754551780_1091_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2)


when i traced the logs, for example the application id is (application_1557754551780_1091),

i checked the path where the output of the Map will be there in (/var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003), the below files are created with these permissions :

-rw-------. 1 hive hadoop 28 May 16 16:17 file.out
-rw-r-----. 1 hive hadoop 32 May 16 16:17 file.out.index


also in the Node manager logs i found this error:

2019-05-16 16:19:05,801 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,818 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,821 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,822 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,824 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,826 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found


which means that file.out wont be readable by the yarn user, which leads the whole task to fail

i also checked the parent directory permissions, i checked the umask for all users (0022), which means that the files inside the output directory should be readable by other users in same group

drwx--x---. 3 hive hadoop 16 May 16 16:16 filecache
drwxr-s---. 3 hive hadoop 60 May 16 16:16 output


I reran the whole scenario on different cluster, and i see that the file.out has same permissions as file.out.index , and the queries are running fine without any problems (cluster HDP version : 3.0.1.0-187), also when i switched to yarn user, and used vi to make sure that yarn user is able to read content of file.out and it was able to.

-rw-r-----. 1 hive hadoop 28 May 16 16:17 file.out
-rw-r-----. 1 hive hadoop 32 May 16 16:17 file.out.index

When i shutdown all the node managers and only 1 is up and running, all the queries are running fine, but also the file.out is still being created with same permissions , but i guess as everything is running on same node then


N.B : we upgraded from HDP 2.6.2 to HDP 3.1.0.0-78

2 ACCEPTED SOLUTIONS

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login
10 REPLIES 10

avatar

Hi, we are facing more or less exactly the same issue on HDP 3.1.0.0-78 on a Cluster with 11 nodes.


Maybe we can talk / chat and work out a solution. I contacted you on LinkedIn 🙂

avatar
Expert Contributor

yeah, sure will happily work with you to get this fixed

avatar
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Expert Contributor

glad to work with you and your team to get this issue fixed

avatar
Expert Contributor
hide-solution

This problem has been solved!

Want to get a detailed solution you have to login/registered on the community

Register/Login

avatar
Contributor

I had same issue, and we are using HDP 3.1.0.0-78 .

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/release-notes/content/patch_tez.html

TEZ-3894 seems to be already applied to HDP 3.1. (Also, I've checked the source code a little, yes, it looks already applied.)

But I still have this issue...

I can avoid this issue by changing fs.permissions.umask-mode from "077" to "022" in a HS2 session.

0: jdbc:hive2://XXXX > set fs.permissions.umask-mode=022;


So I guess, this issue may not be fixed completely with TEZ-3894 (with HDP 3.1.0.0-78)...

avatar

@Maurice Knopp We recently saw that TEZ-3894 only fixes the issue partially. If you job ends up spinning multiple mappers then you are likely to hit a variant of TEZ-3894 although on surface it appears to be same.
For permanent fix, you may want to get a patch for https://issues.apache.org/jira/browse/TEZ-4057

avatar

Thanks for letting me know!

Is there any estimate/timeline when HDP 3.1 will allow to upgrade the shipped version of TEZ 0.9.1. to a newer release? I don't want to upgrade/patch one component myself because I am afraid I will look the upgradeability of the entire HDP Stack when future releases surface....

avatar

@Maurice Knopp We do not yet have any planned dates yet. However, if you are an Enterprise Support customer, you can ask for a hotfix and you will be provided a patch jar which is very easy to replace on all machines with Tez.