Created 05-16-2019 06:49 PM
Hi,
i am running HDP 3.1 (3.1.0.0-78) , i have 10 data nodes , Hive execution engine is TEZ, when i run a query i get this error
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00, diagnostics=[Vertex vertex_1557754551780_1091_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1557754551780_1091_2_00 [Map 1] failed as task task_1557754551780_1091_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 INFO : Completed executing command(queryId=hive_20190516161715_09090e6d-e513-4fcc-9c96-0b48e9b43822); Time taken: 17.935 seconds Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00, diagnostics=[Vertex vertex_1557754551780_1091_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1557754551780_1091_2_00 [Map 1] failed as task task_1557754551780_1091_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2)
when i traced the logs, for example the application id is (application_1557754551780_1091),
i checked the path where the output of the Map will be there in (/var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003), the below files are created with these permissions :
-rw-------. 1 hive hadoop 28 May 16 16:17 file.out -rw-r-----. 1 hive hadoop 32 May 16 16:17 file.out.index
also in the Node manager logs i found this error:
2019-05-16 16:19:05,801 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found 2019-05-16 16:19:05,818 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found 2019-05-16 16:19:05,821 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found 2019-05-16 16:19:05,822 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found 2019-05-16 16:19:05,824 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found 2019-05-16 16:19:05,826 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
which means that file.out wont be readable by the yarn user, which leads the whole task to fail
i also checked the parent directory permissions, i checked the umask for all users (0022), which means that the files inside the output directory should be readable by other users in same group
drwx--x---. 3 hive hadoop 16 May 16 16:16 filecache drwxr-s---. 3 hive hadoop 60 May 16 16:16 output
I reran the whole scenario on different cluster, and i see that the file.out has same permissions as file.out.index , and the queries are running fine without any problems (cluster HDP version : 3.0.1.0-187), also when i switched to yarn user, and used vi to make sure that yarn user is able to read content of file.out and it was able to.
-rw-r-----. 1 hive hadoop 28 May 16 16:17 file.out -rw-r-----. 1 hive hadoop 32 May 16 16:17 file.out.index
When i shutdown all the node managers and only 1 is up and running, all the queries are running fine, but also the file.out is still being created with same permissions , but i guess as everything is running on same node then
N.B : we upgraded from HDP 2.6.2 to HDP 3.1.0.0-78
Created 05-17-2019 03:23 PM
So far, it seems that our issues were solved by setting the HDFS Setting "fs.permissions.umask-mode" to the value of "022". In our HDP 2.7 installation, this was the case out of the box. HDP 3.1 seems to have a default value of 077 - which doesn't work for us and yields the error mentioned above.
We've done some intensive testing right now and the value 022 seems to work and has solved our problems, as far as I can tell now. It would be great if you guys could verify or falify the issue on your installation as well.
Let me know if I can help you with anything!
Created 05-17-2019 10:27 PM
also for more documentation about how we found the solution, in this tez jira ticket https://issues.apache.org/jira/browse/TEZ-3894 its mentioned that tez is getting its intermediate files permissions from "fs.permissions.umask-mode" in our dev environment it was set to 022 but 077 in prod and it was same for you as well so thats how we figured this out, also it was difficult as the file.out.index was created with the correct permission but not the file.out which was causing the result of map not readable by yarn user
Created 08-06-2021 01:25 AM
Had the same issue on CDP 7.1.6, which comes with Tez 0.9.1.
Looks like this: https://issues.apache.org/jira/browse/TEZ-4057
One workaround (probably not 100% secure) is to add the yarn user to the hive group:
usermod -a -G hive yarn
This needs to be done on all nodes and requires Yarn services restart.
After that the issue has gone, no more random errors for Hive on Tez anymore.