Support Questions

Find answers, ask questions, and share your expertise

Getting user not found issue when starting spark job

avatar
Explorer

Facing a problem where randomly we will get "User not found" error when trying to start a container in checking namenode logs I can see the following:

 

2019-09-15 23:55:58,529 INFO containermanager.ContainerManagerImpl (ContainerManagerImpl.java:startContainerInternal(810)) - Start request for container_e65_1565485836435_8640_02_000003 by user pyqy0srv0z50
2019-09-15 23:55:58,529 INFO application.ApplicationImpl (ApplicationImpl.java:transition(304)) - Adding container_e65_1565485836435_8640_02_000003 to application application_1565485836435_8640
2019-09-15 23:55:58,533 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e65_1565485836435_8640_02_000003 transitioned from NEW to LOCALIZING
2019-09-15 23:55:58,533 INFO containermanager.AuxServices (AuxServices.java:handle(215)) - Got event CONTAINER_INIT for appId application_1565485836435_8640
2019-09-15 23:55:58,533 INFO yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(192)) - Initializing container container_e65_1565485836435_8640_02_000003
2019-09-15 23:55:58,533 INFO yarn.YarnShuffleService (YarnShuffleService.java:initializeContainer(284)) - Initializing container container_e65_1565485836435_8640_02_000003
2019-09-15 23:55:58,533 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource hdfs://prod/user/pyqy0srv0z50/.sparkStaging/application_1565485836435_8640/__spark_conf__.zip transitioned from INIT to DOWNLOAD
ING
2019-09-15 23:55:58,533 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:handle(712)) - Created localizer for container_e65_1565485836435_8640_02_000003
2019-09-15 23:55:58,535 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:writeCredentials(1194)) - Writing credentials to the nmPrivate file /app/data/hadoop/disk14/yarn/local/nmPrivate/container_e65_15654858
36435_8640_02_000003.tokens. Credentials list:
2019-09-15 23:55:58,693 WARN privileged.PrivilegedOperationExecutor (PrivilegedOperationExecutor.java:executePrivilegedOperation(171)) - Shell execution returned exit code: 255. Privileged Execution Operation Output:
main : command provided 0
main : run as user is pyqy0srv0z50
main : requested yarn user is pyqy0srv0z50
User pyqy0srv0z50 not found
Full command array for failed execution:

 

2019-09-15 23:55:58,694 WARN nodemanager.LinuxContainerExecutor (LinuxContainerExecutor.java:startLocalizer(269)) - Exit code from container container_e65_1565485836435_8640_02_000003 startLocalizer is : 255
org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=255:
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:177)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
Caused by: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:944)
at org.apache.hadoop.util.Shell.run(Shell.java:848)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
... 2 more
2019-09-15 23:55:58,694 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:run(1134)) - Localizer failed
java.io.IOException: Application application_1565485836435_8640 initialization failed (exitCode=255) with output: main : command provided 0
main : run as user is pyqy0srv0z50
main : requested yarn user is pyqy0srv0z50
User pyqy0srv0z50 not found

at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:273)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1114)
Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationException: ExitCodeException exitCode=255:
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:177)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.startLocalizer(LinuxContainerExecutor.java:264)
... 1 more
Caused by: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:944)
at org.apache.hadoop.util.Shell.run(Shell.java:848)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:1142)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor.executePrivilegedOperation(PrivilegedOperationExecutor.java:151)
... 2 more

 

2019-09-15 23:55:58,694 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e65_1565485836435_8640_02_000003 transitioned from LOCALIZING to LOCALIZATION_FAILED
2019-09-15 23:55:58,697 INFO container.ContainerImpl (ContainerImpl.java:handle(1163)) - Container container_e65_1565485836435_8640_02_000003 transitioned from LOCALIZATION_FAILED to DONE

5 REPLIES 5

avatar
Rising Star

Hi @paleerbccm 

The issue here is that this user ID doesn't exist on one of your YARN NodeManager machines. That probably also explains the randomness of the issue. So long as a container doesn't end up allocated on that machine then you won't run into any problems.

You need to find where the containers fail to launch with that exception, then SSH into the host machine and confirm the problem by running:

 

id pyqy0srv0z50

You will then need to be sure to create this user on that machine and make sure that the user's group membership matches whatever you have on all your other hosts.

avatar
Explorer
This is still happening for us see the error in our production yarn logs below.
 
The issue happened again today.
 
AM Container for appattempt_1573918482316_0389_000001 exited with exitCode: -1000
For more detailed output, check the application tracking page: http://guerlpahdp001.fg.rbc.com:8088/cluster/app/application_1573918482316_0389 Then click on links to logs of each attempt.
Diagnostics: Application application_1573918482316_0389 initialization failed (exitCode=255) with output: main : command provided 0
main : run as user is ptzs0srv0z50
main : requested yarn user is ptzs0srv0z50
User ptzs0srv0z50 not found
 
Failing this attempt

avatar
Rising Star

@paleerbccm it's still the same issue but the log you're sharing doesn't show the details we would need. Your problem is that when the container for the ApplicationMaster attempts to launch on a particular host machine as the ptzs0srv0z50, this shell command fails because the user ID doesn't exist on this machine.

What you need to do is identify where the Application Master attempted to run. You can do this from the Resource Manager's WebUI. Please refer to the following screenshot for example and note the highlighted red box:

Screen Shot 2019-11-20 at 2.36.26 PM.png

You'll see in this example that the first ApplicationMaster attempt was on the host machine worker.example.com. You would then need to SSH into that host machine and run the following command to see if this user actually exists or not:

id ptzs0srv0z50

 

avatar
Explorer

Screen Shot 2019-11-20 at 3.44.44 PM.png

 

I have run the id command for the user on both nodes.

 

Both show the user is valid.

Screen Shot 2019-11-20 at 3.46.47 PM.png

avatar
Rising Star

This issue would really require further debugging. For whatever reason, at that particular time something happened with the user ID resolution. We've seen customers before that had similar issues when tools like SSSD is being used:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/deployment_guide/sssd-...

One idea here is to create a shell script that runs the command 'id ptz0srv0z50' and 'id -Gn ptz0srv0z50' in a loop based on some interval. say 10, 20 or 30 seconds and when the problem occurs just go over the output of that shell script and see if you notice anything different in the output at the time of the issue.