Support Questions

Find answers, ask questions, and share your expertise

Hive - tez , vertex failed error during reduce phase - HDP 3.1.0.0-78

avatar
Expert Contributor

Hi,

i am running HDP 3.1 (3.1.0.0-78) , i have 10 data nodes , Hive execution engine is TEZ, when i run a query i get this error

ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00, diagnostics=[Vertex vertex_1557754551780_1091_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1557754551780_1091_2_00 [Map 1] failed as task task_1557754551780_1091_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
INFO  : Completed executing command(queryId=hive_20190516161715_09090e6d-e513-4fcc-9c96-0b48e9b43822); Time taken: 17.935 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00, diagnostics=[Vertex vertex_1557754551780_1091_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1557754551780_1091_2_00 [Map 1] failed as task task_1557754551780_1091_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2)


when i traced the logs, for example the application id is (application_1557754551780_1091),

i checked the path where the output of the Map will be there in (/var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003), the below files are created with these permissions :

-rw-------. 1 hive hadoop 28 May 16 16:17 file.out
-rw-r-----. 1 hive hadoop 32 May 16 16:17 file.out.index


also in the Node manager logs i found this error:

2019-05-16 16:19:05,801 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,818 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,821 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,822 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,824 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,826 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found


which means that file.out wont be readable by the yarn user, which leads the whole task to fail

i also checked the parent directory permissions, i checked the umask for all users (0022), which means that the files inside the output directory should be readable by other users in same group

drwx--x---. 3 hive hadoop 16 May 16 16:16 filecache
drwxr-s---. 3 hive hadoop 60 May 16 16:16 output


I reran the whole scenario on different cluster, and i see that the file.out has same permissions as file.out.index , and the queries are running fine without any problems (cluster HDP version : 3.0.1.0-187), also when i switched to yarn user, and used vi to make sure that yarn user is able to read content of file.out and it was able to.

-rw-r-----. 1 hive hadoop 28 May 16 16:17 file.out
-rw-r-----. 1 hive hadoop 32 May 16 16:17 file.out.index

When i shutdown all the node managers and only 1 is up and running, all the queries are running fine, but also the file.out is still being created with same permissions , but i guess as everything is running on same node then


N.B : we upgraded from HDP 2.6.2 to HDP 3.1.0.0-78

2 ACCEPTED SOLUTIONS

avatar

So far, it seems that our issues were solved by setting the HDFS Setting "fs.permissions.umask-mode" to the value of "022". In our HDP 2.7 installation, this was the case out of the box. HDP 3.1 seems to have a default value of 077 - which doesn't work for us and yields the error mentioned above.


We've done some intensive testing right now and the value 022 seems to work and has solved our problems, as far as I can tell now. It would be great if you guys could verify or falify the issue on your installation as well.


Let me know if I can help you with anything!

View solution in original post

avatar
Expert Contributor

also for more documentation about how we found the solution, in this tez jira ticket https://issues.apache.org/jira/browse/TEZ-3894 its mentioned that tez is getting its intermediate files permissions from "fs.permissions.umask-mode" in our dev environment it was set to 022 but 077 in prod and it was same for you as well so thats how we figured this out, also it was difficult as the file.out.index was created with the correct permission but not the file.out which was causing the result of map not readable by yarn user

View solution in original post

10 REPLIES 10

avatar

Hi, we are facing more or less exactly the same issue on HDP 3.1.0.0-78 on a Cluster with 11 nodes.


Maybe we can talk / chat and work out a solution. I contacted you on LinkedIn 🙂

avatar
Expert Contributor

yeah, sure will happily work with you to get this fixed

avatar

So far, it seems that our issues were solved by setting the HDFS Setting "fs.permissions.umask-mode" to the value of "022". In our HDP 2.7 installation, this was the case out of the box. HDP 3.1 seems to have a default value of 077 - which doesn't work for us and yields the error mentioned above.


We've done some intensive testing right now and the value 022 seems to work and has solved our problems, as far as I can tell now. It would be great if you guys could verify or falify the issue on your installation as well.


Let me know if I can help you with anything!

avatar
Expert Contributor

glad to work with you and your team to get this issue fixed

avatar
Expert Contributor

also for more documentation about how we found the solution, in this tez jira ticket https://issues.apache.org/jira/browse/TEZ-3894 its mentioned that tez is getting its intermediate files permissions from "fs.permissions.umask-mode" in our dev environment it was set to 022 but 077 in prod and it was same for you as well so thats how we figured this out, also it was difficult as the file.out.index was created with the correct permission but not the file.out which was causing the result of map not readable by yarn user

avatar
Contributor

I had same issue, and we are using HDP 3.1.0.0-78 .

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/release-notes/content/patch_tez.html

TEZ-3894 seems to be already applied to HDP 3.1. (Also, I've checked the source code a little, yes, it looks already applied.)

But I still have this issue...

I can avoid this issue by changing fs.permissions.umask-mode from "077" to "022" in a HS2 session.

0: jdbc:hive2://XXXX > set fs.permissions.umask-mode=022;


So I guess, this issue may not be fixed completely with TEZ-3894 (with HDP 3.1.0.0-78)...

avatar

@Maurice Knopp We recently saw that TEZ-3894 only fixes the issue partially. If you job ends up spinning multiple mappers then you are likely to hit a variant of TEZ-3894 although on surface it appears to be same.
For permanent fix, you may want to get a patch for https://issues.apache.org/jira/browse/TEZ-4057

avatar

Thanks for letting me know!

Is there any estimate/timeline when HDP 3.1 will allow to upgrade the shipped version of TEZ 0.9.1. to a newer release? I don't want to upgrade/patch one component myself because I am afraid I will look the upgradeability of the entire HDP Stack when future releases surface....

avatar

@Maurice Knopp We do not yet have any planned dates yet. However, if you are an Enterprise Support customer, you can ask for a hotfix and you will be provided a patch jar which is very easy to replace on all machines with Tez.