Created on 09-19-2022 06:21 AM - edited 09-19-2022 06:24 AM
Hello everyone,
I am trying to run a python script in a dockerized enviornment using spark and yarn, the cluster is kerberised and I provided the needed realm and keytab for the spark-submit command.
I still face this issue where java only sees user null for some reason.
What I tried :
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=MyImage \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro,/opt/cloudera/parcels/:/opt/cloudera/parcels/:ro,/etc/krb5.conf:/etc/krb5.conf:ro" \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=MyImage \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro,/opt/cloudera/parcels/:/opt/cloudera/parcels/:ro,/etc/krb5.conf:/etc/krb5.conf:ro" \
--principal MyPrincipal\
--keytab MyKeytab \
Script.py
The error resulted from running this :
[2022-09-19 16:09:06.612]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
e Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:2094)
at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:2005)
at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:743)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:693)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:604)
at org.apache.spark.deploy.SparkHadoopUtil.createSparkUser(SparkHadoopUtil.scala:74)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:810)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:2015)
at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:743)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:693)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:604)
at org.apache.spark.deploy.SparkHadoopUtil.createSparkUser(SparkHadoopUtil.scala:74)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:810)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71)
at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:2094)
at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:2005)
at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:743)
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:693)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:604)
at org.apache.spark.deploy.SparkHadoopUtil.createSparkUser(SparkHadoopUtil.scala:74)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:810)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:856)
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:2094)
at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:2005)
... 6 more
I tried everything, made sure everything is configured properly and followed the hadoop documentations to make sure everything is properly set.
The only thing I didn't know how to do is to set a user near conf before submitting
Something similar to :
UserGroupInformation.setLoginUser(UserGroupInformation.createRemoteUser("hduser"))
Which could solve my issue but I cant add this to the script nor I know the conf command for it.
Created 09-19-2022 10:21 AM
Hey There @fares_,
Thank you for writing this in our community.
There was a similar situation with this User and see if this is related:
Additionally, I could see the Error code being 1 in the shared snip of the log,
I was able to trace back the SparkExitCodes[0] definition for you to co-relate(To triangulate the root cause):
And Finally, Did you get to go through our similar Blog against your test-case?
Keep us posted on how it goes.
Created 09-20-2022 03:15 AM
Thank you @vaishaakb for your answer, but sadly non of these sources helped.
I was going through the configs of docker on yarn from apache, and they specified that
yarn.nodemanager.linux-container-executor.group
Should be the same in both yarn-site.xml and container-executor.cfg
so I found that the one in yarn-site was "hadoop" and the one in container-executor.cfg was "yarn" so I went and changed the one in container-executor.cfg to "hadoop" as well.
This action resulted into another error on starting the job, it cannot recognize the yarn directories mounts
[2022-09-20 13:04:35.114]Container exited with a non-zero exit code 29.
[2022-09-20 13:04:35.114]Container exited with a non-zero exit code 29.
For more detailed output, check the application tracking page: https://SERVER/cluster/app/application_1663590757906_0056 Then click on links to logs of each attempt.
. Failing the application.
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1663668271871
final status: FAILED
tracking URL: https://SERVER/cluster/app/application_1663590757906_0056
user: f.alenezi
22/09/20 13:04:36 INFO yarn.Client: Deleted staging directory hdfs://SERVER/user/f.alenezi/.sparkStaging/application_1663590757906_0056
22/09/20 13:04:36 ERROR yarn.Client: Application diagnostics message: Application application_1663590757906_0056 failed 2 times due to AM Container for appattempt_1663590757906_0056_000002 exited with exitCode: 29
Failing this attempt.Diagnostics: [2022-09-20 13:04:35.113]Exception from container-launch.
Container id: container_e55_1663590757906_0056_02_000001
Exit code: 29
Exception message: Launch container failed
Shell error output: Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056'
Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056'
Invalid docker mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:rw', realpath=/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056
Error constructing docker command, docker error code=13, error message='Invalid docker mount'
Shell output: main : command provided 4
main : run as user is f.alenezi
main : requested yarn user is f.alenezi
Creating script paths...
Creating local dirs...
If we can fix this somehow, or locate to the main cause of the issue it might help with the earlier issue.
Any feedback is appreciated.
Created 10-11-2022 02:55 AM
Hi @fares_
In the above application log, we can see clearly the docker mount path is not found. Could you please fix the mount issue? And also verify the spark submit parameter once.
Shell error output: Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056' Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056' Invalid docker mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:rw', realpath=/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056 Error constructing docker command, docker error code=13, error message='Invalid docker mount'
Reference:
Created 10-16-2022 12:29 AM
I solved the mount issue, but that took me back to the same main issue which mentioned in the main post.
I am still trying to resolve this issue, so any help would be appreciated.
Created 10-14-2022 06:20 AM
@fares_ , Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.
Regards,
Vidya Sargur,Created 10-17-2022 02:44 AM
Hey @fares_
Sorry about the delayed update. I was away.
Q. how was the Docker Mount issue resolved? Please share the context of the steps taken.
>>> I solved the mount issue, but that took me back to the same main issue which mentioned in the main post.
Are you still observing invalid Docker mount error?
Did you get to try/test the steps mentioned in our Blog?[0]
In the above blog, Check out the Demo I section, "Running PySpark on the gateway machine with Dockerized Executors in a Kerberized cluster."
Keep us posted.
V
Created 10-17-2022 03:17 AM
Okay so the mounting issue happened when I changed the configurations of
yarn.nodemanager.linux-container-executor.group
in yarn-site.xml and container-executor.cfg to be all "hadoop".
Which I found out later that its unnecessary and this configs should be applied in older versions.
so I reverted the configs to the default, but the mounting issue still persisted.
The solution was is to adjust yarn/nm/usercache to be owned by yarn, then delete the specific user folder, in my case I had to delete f.alenezi folder.
Sicne yarn auto generates folders/files for each job, we have to make sure the ownership is set properly.
Created 10-18-2022 07:00 AM
Thanks for sharing how that was resolved.
Did we achieve the end-goal?
Also,
Q. Did you get to try/test the steps mentioned in our Blog, comparing your spark-submit?[0]
In the above blog, Check out the Demo I section, "Running PySpark on the gateway machine with Dockerized Executors in a Kerberized cluster."
V
Created 10-18-2022 11:37 PM
Hello @vaishaakb ,
Sadly we did not reach the solution for the main issue yet.
Yes, I checked this blog and I also checked every documentation provided by cloudera or others to try resolving this issue but no luck.
Also I want to point out that the blog's first demo is not working properly and the cloudera team posted something that shows an error
ImportError: No module named numpy
Which proves that the docker image didn't work with pyspark properly.