Created 08-10-2022 08:46 AM
Hello everyone.
I am trying to train a ML model on a cluster by using a docker image and spark-submit with yarn.
I already tried to follow this process before on a training cluster I made and I succeeded.
But when I run it this time, yarn prompts that one of the mounts is invalid.
Of course I tried with Kerborise and without, both didn't work and both runs didn't imply any problems related to connection so we are good from this side.
This is what I tried:
spark-submit \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=image:v1 \
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro,/opt/cloudera/parcels/:/opt/cloudera/parcels/:ro,/data01/yarn/nm/:/data01/yarn/nm/:ro,/data02/yarn/nm/:/data02/yarn/nm/:ro,/data03/yarn/nm/:/data03/yarn/nm/:ro,/etc/krb5.conf:/etc/krb5.conf:ro" \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=image:v1 \
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS="/etc/passwd:/etc/passwd:ro,/etc/hadoop:/etc/hadoop:ro,/opt/cloudera/parcels/:/opt/cloudera/parcels/:ro,/data01/yarn/nm/:/data01/yarn/nm/:ro,/data02/yarn/nm/:/data02/yarn/nm/:ro,/data03/yarn/nm/:/data03/yarn/nm/:ro,/etc/krb5.conf:/etc/krb5.conf:ro" \
modeling.py
And this is the results I got:
2022-08-10 17:43:17,694 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1658823376901_2680 failed 2 times due to AM Container for appattempt_1658823376901_2680_000002 exited with exitCode: -1
Failing this attempt.Diagnostics: [2022-08-10 17:43:17.686]Exception from container-launch.
Container id: container_e43_1658823376901_2680_02_000001
Exit code: -1
Exception message: Invalid mount : /data03/yarn/nm
Shell error output: <unknown>
Shell output: <unknown>
[2022-08-10 17:43:17.687]Container exited with a non-zero exit code -1.
[2022-08-10 17:43:17.687]Container exited with a non-zero exit code -1.
P.S: I followed all the instructions and documentations needed to run this. And made all the necessary configs. Like I said before I already ran this on another cluster.
Any help would be greatly appreciated.
Created 09-15-2022 05:26 AM
Apparently, having multiple directories for yarn and yarn logs causes a misconfiguration when writing the yarn-site.xml file.
The solution is to go to cloudera manager -> yarn -> Configuration
then search for yarn_service_config_safety_valve
Add a new one by pressing the plus sign on the right:
Name: yarn.nodemanager.runtime.linux.docker.default-rw-mounts
Value: /data01/yarn/nm:/data01/yarn/nm,/data02/yarn/nm:/data02/yarn/nm,/data03/yarn/nm:/data03/yarn/nm,/data04/yarn/nm:/data04/yarn/nm,/data01/yarn/container-logs:/data01/yarn/container-logs,/data02/yarn/container-logs:/data02/yarn/container-logs,/data03/yarn/container-logs:/data03/yarn/container-logs,/data04/yarn/container-logs:/data04/yarn/container-logs
Of course you have to specify the directories as what fits your configs.
Created 09-15-2022 05:26 AM
Apparently, having multiple directories for yarn and yarn logs causes a misconfiguration when writing the yarn-site.xml file.
The solution is to go to cloudera manager -> yarn -> Configuration
then search for yarn_service_config_safety_valve
Add a new one by pressing the plus sign on the right:
Name: yarn.nodemanager.runtime.linux.docker.default-rw-mounts
Value: /data01/yarn/nm:/data01/yarn/nm,/data02/yarn/nm:/data02/yarn/nm,/data03/yarn/nm:/data03/yarn/nm,/data04/yarn/nm:/data04/yarn/nm,/data01/yarn/container-logs:/data01/yarn/container-logs,/data02/yarn/container-logs:/data02/yarn/container-logs,/data03/yarn/container-logs:/data03/yarn/container-logs,/data04/yarn/container-logs:/data04/yarn/container-logs
Of course you have to specify the directories as what fits your configs.