About BrightLiao

BrightLiao · ‎01-19-2021

It's the same LDAP uid conflicts issue with this one: https://community.cloudera.com/t5/Support-Questions/Hive-tez-vertex-failed-during-map-phase-HDP-3-1-4/td-p/309973

BrightLiao · ‎01-19-2021

Finally, I managed to solve this issue. I'd like to add some details for others might be lost in the same trap. The root cause is that the user UIDs in our LDAP clashes with the UIDs of hdp internal user UIDs(e.g. hive, spark, infra-solr, hbase...). To be more specific in the above case, the user test has UID 1002 in our LDAP configuration, while the hdp internal user 'infra-solr' has the same UID. From the log, we can see some INPUT_READ_ERROR, the error is raised by some downstream task complaining that it cannot get data from the upstream task. The reason behind it is that when the downstream task try to get upstream task data from shuffle service (which runs on yarn node manager as an aux-service), a lot of HTTP 500 error raised. Why shuffle service raises 500? It's because it checks the permission (owner) of data file sent out by upstream task and found the user name does not match, so it raised an error. When we look at the logs from node manager, we can see bunch of errors like below: Caused by: java.io.IOException: Owner 'infra-solr' for path /xxx/hadoop/yarn/local/usercache/xxx/appcache/application_1610695610027_0576/output/attempt_1610695610027_0576_2_00_000017_0_10002/file.out.index did not match expected owner 'test' The related source code: - https://github.com/apache/hadoop/blob/master/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/ShuffleHandler.java - https://github.com/apache/tez/blob/master/tez-plugins/tez-aux-services/src/main/java/org/apache/tez/auxservices/ShuffleHandler.java The lesson here is: when we configure uid/gid in LDAP, we should be more careful about the potential conflicts.

BrightLiao · ‎01-19-2021

When I try to look at logs from command `yarn logs -applicationId xxx`, some errors raised: Response from the NodeManager:dp006:45454 WebService is not successful, HTTP error code: 500, Server response: Exception reading log file. Application submitted by 'xxx' doesn't own requested log file : syslog Can not get any logs for the log file: stderr Response from the NodeManager:dp006:45454 WebService is not successful, HTTP error code: 500, Server response: Exception reading log file. Application submitted by 'xxx' doesn't own requested log file : stderr Can not get any logs for the log file: prelaunch.out Response from the NodeManager:dp006:45454 WebService is not successful, HTTP error code: 500, Server response: Exception reading log file. Application submitted by 'xxx' doesn't own requested log file : prelaunch.out Can not get any logs for the log file: launch_container.sh The question is similar to https://community.cloudera.com/t5/Support-Questions/Cannot-read-log-files-on-applications-started-by-Active/td-p/214539. But the exception is another one. From the node manager log, the error is something like: java.io.IOException: Owner 'infra-solr' for path /data02/hadoop/yarn/log/application_1610695610027_0516/container_e 11_1610695610027_0516_01_000002/prelaunch.out did not match expected owner 'xxx' at org.apache.hadoop.io.SecureIOUtils.checkStat(SecureIOUtils.java:281) at org.apache.hadoop.io.SecureIOUtils.forceSecureOpenForRead(SecureIOUtils.java:215) at org.apache.hadoop.io.SecureIOUtils.openForRead(SecureIOUtils.java:200) at org.apache.hadoop.yarn.server.nodemanager.webapp.ContainerLogsUtils.openLogFileForRead(ContainerLogsUtil s.java:169) at org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getLogs(NMWebServices.java:467) at org.apache.hadoop.yarn.server.nodemanager.webapp.NMWebServices.getContainerLogFile(NMWebServices.java:38 0) at sun.reflect.GeneratedMethodAccessor279.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$Response My question is that why the log file owner is 'infra-solr', is it copied by ‘infra-solr’ from somewhere else? Hope someone could help, Thanks!

BrightLiao · ‎01-19-2021

I'd like to add more information after some dig into the logs. I found some debug log below with INPUT_READ_ERROR_EVENT triggered (from log file: syslog_dag_1610695610027_0484_1): It's from a similar sql with some aggregation, but raised the same error from hive shell. It looks when the Reducer was trying to read from the result of the Mapper, some error occurred and the data cannot be passed from Mapper to Reducer. The question still remains: why the sql (in tez) succeeds for hive user, but failed for other users?

BrightLiao · ‎01-19-2021

Hi there, I'm running hive with HDP 3.1.4, and with kerberos enabled. Currently there is a strange issue. Some sql could be success only if we run it with the system user `hive`, otherwise it fails with map error. A simple case is as below: 1. Create a table named `test` with columns (id int, val string), and data [(1, '1'), (2, '2')]. 2. Run sql with user hive: `select * from test t join test t1 on t.id=t1.id;`, it succeeds as below: 3. Run the same sql with some other user (e.g. test): `select * from test t join test t1 on t.id=t1.id;`, it fails with a vertex issue: ``` INFO : Compiling command(queryId=hive_20210119153558_48453ef9-4b06-48ed-b9f6-f64f82992d81): select * from test t join test t1 on t.id=t1.id INFO : Concurrency mode is disabled, not creating a lock manager INFO : Semantic Analysis Completed (retrial = false) INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.id, type:int, comment:null), FieldSchema(name:t.val, type:string, comment:null), FieldSchema(name:t1.id, type:int, comment:null), FieldSchema(name:t1.val, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20210119153558_48453ef9-4b06-48ed-b9f6-f64f82992d81); Time taken: 0.211 seconds INFO : Concurrency mode is disabled, not creating a lock manager INFO : Executing command(queryId=hive_20210119153558_48453ef9-4b06-48ed-b9f6-f64f82992d81): select * from test t join test t1 on t.id=t1.id INFO : Query ID = hive_20210119153558_48453ef9-4b06-48ed-b9f6-f64f82992d81 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Subscribed to counters: [] for queryId: hive_20210119153558_48453ef9-4b06-48ed-b9f6-f64f82992d81 INFO : Session is already open INFO : Dag name: select * from test t join test ...t.id=t1.id (Stage-1) INFO : Setting tez.task.scale.memory.reserve-fraction to 0.30000001192092896 INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO : Session re-established. INFO : Status: Running (Executing on YARN cluster with App id application_1610695610027_0489) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 2 .......... container SUCCEEDED 1 1 0 0 4 0 Map 1 container RUNNING 1 0 1 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 01/02 [=============>>-------------] 50% ELAPSED TIME: 5.98 s ---------------------------------------------------------------------------------------------- ERROR : Status: Failed ERROR : Vertex re-running, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00 ERROR : Vertex re-running, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00 ERROR : Vertex re-running, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00 ERROR : Vertex failed, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00, diagnostics=[Vertex vertex_1610695610027_0489_1_00 [Map 2] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1610695610027_0489_1_00 [Map 2] failed as task task_1610695610027_0489_1_00_000000 failed after vertex succeeded.] ERROR : DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 INFO : org.apache.tez.common.counters.DAGCounter: INFO : NUM_FAILED_TASKS: 4 INFO : NUM_KILLED_TASKS: 1 INFO : NUM_SUCCEEDED_TASKS: 4 INFO : TOTAL_LAUNCHED_TASKS: 5 INFO : AM_CPU_MILLISECONDS: 3300 INFO : AM_GC_TIME_MILLIS: 0 INFO : org.apache.hadoop.hive.ql.exec.tez.HiveInputCounters: INFO : GROUPED_INPUT_SPLITS_Map_1: 1 INFO : GROUPED_INPUT_SPLITS_Map_2: 1 INFO : INPUT_DIRECTORIES_Map_1: 1 INFO : INPUT_DIRECTORIES_Map_2: 1 INFO : INPUT_FILES_Map_1: 1 INFO : INPUT_FILES_Map_2: 1 INFO : RAW_INPUT_SPLITS_Map_1: 1 INFO : RAW_INPUT_SPLITS_Map_2: 1 ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00Vertex re-running, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00Vertex re-running, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00Vertex failed, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00, diagnostics=[Vertex vertex_1610695610027_0489_1_00 [Map 2] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1610695610027_0489_1_00 [Map 2] failed as task task_1610695610027_0489_1_00_000000 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 INFO : Completed executing command(queryId=hive_20210119153558_48453ef9-4b06-48ed-b9f6-f64f82992d81); Time taken: 12.207 seconds INFO : Concurrency mode is disabled, not creating a lock manager Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00Vertex re-running, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00Vertex re-running, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00Vertex failed, vertexName=Map 2, vertexId=vertex_1610695610027_0489_1_00, diagnostics=[Vertex vertex_1610695610027_0489_1_00 [Map 2] killed/failed due to:OWN_TASK_FAILURE, Vertex vertex_1610695610027_0489_1_00 [Map 2] failed as task task_1610695610027_0489_1_00_000000 failed after vertex succeeded.]DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0 (state=08S01,code=2) ``` From the above log, we can see the first Map task failed 4 times, and thus failed the query. Since I cannot upload file, so I didnt attach the full applicaiton log. Will be really appreciated if anyone could help have a look. Thanks!

Online	Offline
Last Visited	‎08-03-2024 12:11 PM

Member Since	‎07-13-2020 07:39 PM
Last Visited	‎08-03-2024 12:11 PM
Posts	5

Cloudera Community

Re: Cannot read log files on applications started ...

Re: Hive - tez , vertex failed during map phase - ...

Re: Cannot read log files on applications started ...

Re: Hive - tez , vertex failed during map phase - ...

Cannot read log files on applications started by L...

Re: Hive - tez , vertex failed during map phase - ...

Hive - tez , vertex failed during map phase - HDP ...