Support Questions

Find answers, ask questions, and share your expertise

Hive - tez , vertex failed error during reduce phase - HDP 3.1.0.0-78

avatar
Expert Contributor

Hi,

i am running HDP 3.1 (3.1.0.0-78) , i have 10 data nodes , Hive execution engine is TEZ, when i run a query i get this error

ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00, diagnostics=[Vertex vertex_1557754551780_1091_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1557754551780_1091_2_00 [Map 1] failed as task task_1557754551780_1091_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0
INFO  : Completed executing command(queryId=hive_20190516161715_09090e6d-e513-4fcc-9c96-0b48e9b43822); Time taken: 17.935 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00, diagnostics=[Vertex vertex_1557754551780_1091_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1557754551780_1091_2_00 [Map 1] failed as task task_1557754551780_1091_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2)


when i traced the logs, for example the application id is (application_1557754551780_1091),

i checked the path where the output of the Map will be there in (/var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003), the below files are created with these permissions :

-rw-------. 1 hive hadoop 28 May 16 16:17 file.out
-rw-r-----. 1 hive hadoop 32 May 16 16:17 file.out.index


also in the Node manager logs i found this error:

2019-05-16 16:19:05,801 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,818 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,821 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,822 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,824 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
2019-05-16 16:19:05,826 INFO  mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found


which means that file.out wont be readable by the yarn user, which leads the whole task to fail

i also checked the parent directory permissions, i checked the umask for all users (0022), which means that the files inside the output directory should be readable by other users in same group

drwx--x---. 3 hive hadoop 16 May 16 16:16 filecache
drwxr-s---. 3 hive hadoop 60 May 16 16:16 output


I reran the whole scenario on different cluster, and i see that the file.out has same permissions as file.out.index , and the queries are running fine without any problems (cluster HDP version : 3.0.1.0-187), also when i switched to yarn user, and used vi to make sure that yarn user is able to read content of file.out and it was able to.

-rw-r-----. 1 hive hadoop 28 May 16 16:17 file.out
-rw-r-----. 1 hive hadoop 32 May 16 16:17 file.out.index

When i shutdown all the node managers and only 1 is up and running, all the queries are running fine, but also the file.out is still being created with same permissions , but i guess as everything is running on same node then


N.B : we upgraded from HDP 2.6.2 to HDP 3.1.0.0-78

2 ACCEPTED SOLUTIONS

avatar

So far, it seems that our issues were solved by setting the HDFS Setting "fs.permissions.umask-mode" to the value of "022". In our HDP 2.7 installation, this was the case out of the box. HDP 3.1 seems to have a default value of 077 - which doesn't work for us and yields the error mentioned above.


We've done some intensive testing right now and the value 022 seems to work and has solved our problems, as far as I can tell now. It would be great if you guys could verify or falify the issue on your installation as well.


Let me know if I can help you with anything!

View solution in original post

avatar
Expert Contributor

also for more documentation about how we found the solution, in this tez jira ticket https://issues.apache.org/jira/browse/TEZ-3894 its mentioned that tez is getting its intermediate files permissions from "fs.permissions.umask-mode" in our dev environment it was set to 022 but 077 in prod and it was same for you as well so thats how we figured this out, also it was difficult as the file.out.index was created with the correct permission but not the file.out which was causing the result of map not readable by yarn user

View solution in original post

10 REPLIES 10

avatar
New Contributor

Had the same issue on CDP 7.1.6, which comes with Tez 0.9.1.

Looks like this: https://issues.apache.org/jira/browse/TEZ-4057

One workaround (probably not 100% secure) is to add the yarn user to the hive group:

usermod -a -G hive yarn

This needs to be done on all nodes and requires Yarn services restart.

After that the issue has gone, no more random errors for Hive on Tez anymore.