Created 05-16-2019 06:49 PM
Hi,
i am running HDP 3.1 (3.1.0.0-78) , i have 10 data nodes , Hive execution engine is TEZ, when i run a query i get this error
ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00, diagnostics=[Vertex vertex_1557754551780_1091_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1557754551780_1091_2_00 [Map 1] failed as task task_1557754551780_1091_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 INFO : Completed executing command(queryId=hive_20190516161715_09090e6d-e513-4fcc-9c96-0b48e9b43822); Time taken: 17.935 seconds Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex re-running, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00Vertex failed, vertexName=Map 1, vertexId=vertex_1557754551780_1091_2_00, diagnostics=[Vertex vertex_1557754551780_1091_2_00 [Map 1] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1557754551780_1091_2_00 [Map 1] failed as task task_1557754551780_1091_2_00_000001 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2)
when i traced the logs, for example the application id is (application_1557754551780_1091),
i checked the path where the output of the Map will be there in (/var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003), the below files are created with these permissions :
-rw-------. 1 hive hadoop 28 May 16 16:17 file.out -rw-r-----. 1 hive hadoop 32 May 16 16:17 file.out.index
also in the Node manager logs i found this error:
2019-05-16 16:19:05,801 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found 2019-05-16 16:19:05,818 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found 2019-05-16 16:19:05,821 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found 2019-05-16 16:19:05,822 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found 2019-05-16 16:19:05,824 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found 2019-05-16 16:19:05,826 INFO mapred.ShuffleHandler (ShuffleHandler.java:sendMapOutput(1268)) - /var/lib/hadoop/yarn/local/usercache/hive/appcache/application_1557754551780_1091/output/attempt_1557754551780_1091_2_00_000000_0_10003/file.out not found
which means that file.out wont be readable by the yarn user, which leads the whole task to fail
i also checked the parent directory permissions, i checked the umask for all users (0022), which means that the files inside the output directory should be readable by other users in same group
drwx--x---. 3 hive hadoop 16 May 16 16:16 filecache drwxr-s---. 3 hive hadoop 60 May 16 16:16 output
I reran the whole scenario on different cluster, and i see that the file.out has same permissions as file.out.index , and the queries are running fine without any problems (cluster HDP version : 3.0.1.0-187), also when i switched to yarn user, and used vi to make sure that yarn user is able to read content of file.out and it was able to.
-rw-r-----. 1 hive hadoop 28 May 16 16:17 file.out -rw-r-----. 1 hive hadoop 32 May 16 16:17 file.out.index
When i shutdown all the node managers and only 1 is up and running, all the queries are running fine, but also the file.out is still being created with same permissions , but i guess as everything is running on same node then
N.B : we upgraded from HDP 2.6.2 to HDP 3.1.0.0-78
Created 05-17-2019 03:23 PM
So far, it seems that our issues were solved by setting the HDFS Setting "fs.permissions.umask-mode" to the value of "022". In our HDP 2.7 installation, this was the case out of the box. HDP 3.1 seems to have a default value of 077 - which doesn't work for us and yields the error mentioned above.
We've done some intensive testing right now and the value 022 seems to work and has solved our problems, as far as I can tell now. It would be great if you guys could verify or falify the issue on your installation as well.
Let me know if I can help you with anything!
Created 05-17-2019 10:27 PM
also for more documentation about how we found the solution, in this tez jira ticket https://issues.apache.org/jira/browse/TEZ-3894 its mentioned that tez is getting its intermediate files permissions from "fs.permissions.umask-mode" in our dev environment it was set to 022 but 077 in prod and it was same for you as well so thats how we figured this out, also it was difficult as the file.out.index was created with the correct permission but not the file.out which was causing the result of map not readable by yarn user
Created 05-17-2019 08:40 AM
Hi, we are facing more or less exactly the same issue on HDP 3.1.0.0-78 on a Cluster with 11 nodes.
Maybe we can talk / chat and work out a solution. I contacted you on LinkedIn 🙂
Created 05-17-2019 09:49 AM
yeah, sure will happily work with you to get this fixed
Created 05-17-2019 03:23 PM
So far, it seems that our issues were solved by setting the HDFS Setting "fs.permissions.umask-mode" to the value of "022". In our HDP 2.7 installation, this was the case out of the box. HDP 3.1 seems to have a default value of 077 - which doesn't work for us and yields the error mentioned above.
We've done some intensive testing right now and the value 022 seems to work and has solved our problems, as far as I can tell now. It would be great if you guys could verify or falify the issue on your installation as well.
Let me know if I can help you with anything!
Created 05-17-2019 10:12 PM
glad to work with you and your team to get this issue fixed
Created 05-17-2019 10:27 PM
also for more documentation about how we found the solution, in this tez jira ticket https://issues.apache.org/jira/browse/TEZ-3894 its mentioned that tez is getting its intermediate files permissions from "fs.permissions.umask-mode" in our dev environment it was set to 022 but 077 in prod and it was same for you as well so thats how we figured this out, also it was difficult as the file.out.index was created with the correct permission but not the file.out which was causing the result of map not readable by yarn user
Created 05-22-2019 05:54 PM
I had same issue, and we are using HDP 3.1.0.0-78 .
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/release-notes/content/patch_tez.html
TEZ-3894 seems to be already applied to HDP 3.1. (Also, I've checked the source code a little, yes, it looks already applied.)
But I still have this issue...
I can avoid this issue by changing fs.permissions.umask-mode from "077" to "022" in a HS2 session.
0: jdbc:hive2://XXXX > set fs.permissions.umask-mode=022; |
So I guess, this issue may not be fixed completely with TEZ-3894 (with HDP 3.1.0.0-78)...
Created 05-23-2019 03:41 AM
@Maurice Knopp We recently saw that TEZ-3894 only fixes the issue partially. If you job ends up spinning multiple mappers then you are likely to hit a variant of TEZ-3894 although on surface it appears to be same.
For permanent fix, you may want to get a patch for https://issues.apache.org/jira/browse/TEZ-4057
Created 05-23-2019 07:06 AM
Thanks for letting me know!
Is there any estimate/timeline when HDP 3.1 will allow to upgrade the shipped version of TEZ 0.9.1. to a newer release? I don't want to upgrade/patch one component myself because I am afraid I will look the upgradeability of the entire HDP Stack when future releases surface....
Created 05-23-2019 01:26 PM
@Maurice Knopp We do not yet have any planned dates yet. However, if you are an Enterprise Support customer, you can ask for a hotfix and you will be provided a patch jar which is very easy to replace on all machines with Tez.