Support Questions

fares_ · ‎09-19-2022

Hello everyone,

I am trying to run a python script in a dockerized enviornment using spark and yarn, the cluster is kerberised and I provided the needed realm and keytab for the spark-submit command.

I still face this issue where java only sees user null for some reason.

What I tried :

spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=MyImage \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro,/opt/cloudera/parcels/:/opt/cloudera/parcels/:ro,/etc/krb5.conf:/etc/krb5.conf:ro" \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=MyImage \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro,/opt/cloudera/parcels/:/opt/cloudera/parcels/:ro,/etc/krb5.conf:/etc/krb5.conf:ro" \
--principal MyPrincipal\
--keytab MyKeytab \
Script.py

The error resulted from running this :

[2022-09-19 16:09:06.612]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
e Method)
        at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
        at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
        at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:2094)
        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:2005)
        at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:743)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:693)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:604)
        at org.apache.spark.deploy.SparkHadoopUtil.createSparkUser(SparkHadoopUtil.scala:74)
        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:810)
        at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:2015)
        at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:743)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:693)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:604)
        at org.apache.spark.deploy.SparkHadoopUtil.createSparkUser(SparkHadoopUtil.scala:74)
        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:810)
        at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: javax.security.auth.login.LoginException: java.lang.NullPointerException: invalid null input: name
        at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71)
        at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
        at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
        at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
        at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
        at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
        at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:2094)
        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:2005)
        at org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:743)
        at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:693)
        at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:604)
        at org.apache.spark.deploy.SparkHadoopUtil.createSparkUser(SparkHadoopUtil.scala:74)
        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:810)
        at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)

        at javax.security.auth.login.LoginContext.invoke(LoginContext.java:856)
        at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
        at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
        at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
        at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
        at org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:2094)
        at org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:2005)
        ... 6 more

I tried everything, made sure everything is configured properly and followed the hadoop documentations to make sure everything is properly set.

The only thing I didn't know how to do is to set a user near conf before submitting

Something similar to :

UserGroupInformation.setLoginUser(UserGroupInformation.createRemoteUser("hduser"))

Which could solve my issue but I cant add this to the script nor I know the conf command for it.

vaishaakb · ‎09-19-2022

Hey There @fares_,

Thank you for writing this in our community.

There was a similar situation with this User and see if this is related:

https://community.cloudera.com/t5/Support-Questions/Container-exited-with-a-non-zero-exit-code-13-Er...

Additionally, I could see the Error code being 1 in the shared snip of the log,

I was able to trace back the SparkExitCodes[0] definition for you to co-relate(To triangulate the root cause):

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/SparkExitCode....

And Finally, Did you get to go through our similar Blog against your test-case?

https://blog.cloudera.com/introducing-apache-spark-on-docker-on-top-of-apache-yarn-with-cdp-datacent...

Keep us posted on how it goes.

fares_ · ‎09-20-2022

Thank you @vaishaakb for your answer, but sadly non of these sources helped.

I was going through the configs of docker on yarn from apache, and they specified that

yarn.nodemanager.linux-container-executor.group

Should be the same in both yarn-site.xml and container-executor.cfg

so I found that the one in yarn-site was "hadoop" and the one in container-executor.cfg was "yarn" so I went and changed the one in container-executor.cfg to "hadoop" as well.

This action resulted into another error on starting the job, it cannot recognize the yarn directories mounts

[2022-09-20 13:04:35.114]Container exited with a non-zero exit code 29.
[2022-09-20 13:04:35.114]Container exited with a non-zero exit code 29.
For more detailed output, check the application tracking page: https://SERVER/cluster/app/application_1663590757906_0056 Then click on links to logs of each attempt.
. Failing the application.
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1663668271871
         final status: FAILED
         tracking URL: https://SERVER/cluster/app/application_1663590757906_0056
         user: f.alenezi
22/09/20 13:04:36 INFO yarn.Client: Deleted staging directory hdfs://SERVER/user/f.alenezi/.sparkStaging/application_1663590757906_0056
22/09/20 13:04:36 ERROR yarn.Client: Application diagnostics message: Application application_1663590757906_0056 failed 2 times due to AM Container for appattempt_1663590757906_0056_000002 exited with  exitCode: 29
Failing this attempt.Diagnostics: [2022-09-20 13:04:35.113]Exception from container-launch.
Container id: container_e55_1663590757906_0056_02_000001
Exit code: 29
Exception message: Launch container failed
Shell error output: Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056'
Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056'
Invalid docker mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:rw', realpath=/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056
Error constructing docker command, docker error code=13, error message='Invalid docker mount'

Shell output: main : command provided 4
main : run as user is f.alenezi
main : requested yarn user is f.alenezi
Creating script paths...
Creating local dirs...

If we can fix this somehow, or locate to the main cause of the issue it might help with the earlier issue.

Any feedback is appreciated.

RangaReddy · ‎10-11-2022

Hi @fares_

In the above application log, we can see clearly the docker mount path is not found. Could you please fix the mount issue? And also verify the spark submit parameter once.

Shell error output: Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056'
Could not determine real path of mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056'
Invalid docker mount '/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056:rw', realpath=/data01/yarn/nm/usercache/f.alenezi/appcache/application_1663590757906_0056
Error constructing docker command, docker error code=13, error message='Invalid docker mount'

Reference:

https://docs.cloudera.com/cdp-private-cloud-base/7.1.6/yarn-managing-docker-containers/topics/yarn-d...

fares_ · ‎10-16-2022

I solved the mount issue, but that took me back to the same main issue which mentioned in the main post.

I am still trying to resolve this issue, so any help would be appreciated.

VidyaSargur · ‎10-14-2022

@fares_ , Has the reply helped resolve your issue? If so, please mark the appropriate reply as the solution, as it will make it easier for others to find the answer in the future.

Regards,

Vidya Sargur,
Community Manager

Was your question answered? Make sure to mark the answer as the accepted solution.
If you find a reply useful, say thanks by clicking on the thumbs up button.
Learn more about the Cloudera Community:
Community Guidelines
How to use the forum

vaishaakb · ‎10-17-2022

Hey @fares_

Sorry about the delayed update. I was away.

Q. how was the Docker Mount issue resolved? Please share the context of the steps taken.

>>> I solved the mount issue, but that took me back to the same main issue which mentioned in the main post.

Are you still observing invalid Docker mount error?

Did you get to try/test the steps mentioned in our Blog?[0]

[0] https://blog.cloudera.com/introducing-apache-spark-on-docker-on-top-of-apache-yarn-with-cdp-datacent...

In the above blog, Check out the Demo I section, "Running PySpark on the gateway machine with Dockerized Executors in a Kerberized cluster."

Keep us posted.

V

fares_ · ‎10-17-2022

Okay so the mounting issue happened when I changed the configurations of

yarn.nodemanager.linux-container-executor.group

in yarn-site.xml and container-executor.cfg to be all "hadoop".

Which I found out later that its unnecessary and this configs should be applied in older versions.

so I reverted the configs to the default, but the mounting issue still persisted.

The solution was is to adjust yarn/nm/usercache to be owned by yarn, then delete the specific user folder, in my case I had to delete f.alenezi folder.

Sicne yarn auto generates folders/files for each job, we have to make sure the ownership is set properly.

vaishaakb · ‎10-18-2022

Thanks for sharing how that was resolved.

Did we achieve the end-goal?

Also,

Q. Did you get to try/test the steps mentioned in our Blog, comparing your spark-submit?[0]

[0] https://blog.cloudera.com/introducing-apache-spark-on-docker-on-top-of-apache-yarn-with-cdp-datacent...

In the above blog, Check out the Demo I section, "Running PySpark on the gateway machine with Dockerized Executors in a Kerberized cluster."

V

fares_ · ‎10-18-2022

Hello @vaishaakb ,

Sadly we did not reach the solution for the main issue yet.

Yes, I checked this blog and I also checked every documentation provided by cloudera or others to try resolving this issue but no luck.

Also I want to point out that the blog's first demo is not working properly and the cloudera team posted something that shows an error

ImportError: No module named numpy

Which proves that the docker image didn't work with pyspark properly.

Support Questions

Python script with docker on yarn results in javax.security.auth.login.LoginException