Created on 03-31-2015 04:26 AM - edited 09-16-2022 02:25 AM
I am having some problems with YARN and it is not the first clusters where this happens, so I don't get what I am doing wrong. Every night I shut down the clusters (installed on AWS and SoftLayer) to not spend money while not working. Also, sooner or later I need bigger machines, so I change the AWS instance type (similar name also for SoftLayer). What happens in a not very clear moment is that after a particular restart YARN generates problems in the NodeManager user cache directory (e.g. /bigdata1/yarn/nm/usercache/m.giusto), like in this case (https://community.cloudera.com/t5/Data-Ingestion-Integration/Sqoop-Error-Sqoop-2-not-working-through... and I am forced to remove everything from all the user cache directories (acceptable) otherwise jobs are unable to start.
However the bigger problem is that YARN also starts applying a not desired rule for which each user that submit a job is considered not allowed and YARN starts the job as "nobody" (yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user default value). This happens for a not super-user like "m.giusto" (UID over 1000) and also for "hdfs" (UID less than 500). I have tried to move "hdfs" from "banned.users" to "allowed.system.users" and to set "min.user.id" to 0 , no changes. Moreover "nobody" user is not able to write to the real-user user cache folder (permission denied) and so the job fails.
main : user is nobody
main : requested yarn user is m.giusto
Can't create directory /bigdata1/yarn/nm/usercache/m.giusto/appcache/application_1427799738120_0001 - Permission denied
Can't create directory /bigdata2/yarn/nm/usercache/m.giusto/appcache/application_1427799738120_0001 - Permission denied
Did not create any app directories
.Failing this attempt.. Failing the application.
What I do not get is why the system starts applying these rules and how to fix. At the moment the only solution is to reinstall the cluster..
Some other infos: OS is Centos6.6, tested CDH version are 5.2.1, 5.3.1 and 5.3.2.
It sounds like YARN is running under the linux container executor which is used to provide secure containers or resource isolation (using cgroups). If you don't need these features or if this was enabled accidentally then you can probably fix the problem by unchecking "Always Use Linux Container Executor" in YARN configuration under Cloudera Manager.
Or if you do need LCE, then one thing to check is that your local users (e.g. m.giusto) exist on all nodes .
thanks for the quick answer. You are right, the problem is that it is enabled the flag of "Always Use Linux Container Executor". I have unchecked it and now things seems to be working.
However, the description on Cloudera Manager of the "Always Use Linux Container Executor" flag says "Cgroups enforcement only works when the Linux Container Executor is used", so if I want to use the desired "Static Resource Pool" where YARN gets only X% of the resources, I have to maintain the flag enabled (now I also understand when the flag gets checked, after making the first configuration of the resource pool...). So I have tried to install what needed for cgroups (libcgroup) and reenabled the flag.
Now if I execute YARN application (like Hive query) everything works. If instead I try to execute a Oozie job with a shell action inside, the shell action is executed by "nobody" user (real Oozie user "m.giusto"). Normally shell action are executed as "yarn", so I have added "yarn" in "allowed.system.users" and removed it from "banned.users". "nobody" user remains the MR user. Any idea?
thanks for the reply. I didn't know about the "yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users" property. I will explore the two opportunities in order to choose the better one for the current experiments.
I have followed the instruction to allow Linux user to run the container as the user which lunch the application, but still, I can see the actual user is nobody. Even though I can see this:
main : run as user is nobody main : requested yarn user is panahi
I have OpenLDAP sync users and groups in all the nodes across the cluster. My only problem is the yarn containers are launched either by yarn in a non-secure cluster with default values, or nobody when you change "Limit Nonsecure Container Executor Users" to false, and "yarn.nodemanager.container-executor.class" to true.
Despite the fact Spark runs with the user which runs it, the result of this snippet which Spark calls another application is always either yarn or nobody in any situation:
val test = sc.parallelize(Seq("test user")).repartition(1) val piped = test.pipe(Seq("whoami")) val c = piped.collect()
Hi @Harsh J
Thanks for your response. Yes, these are the two configs you mentioned, and also I checked all the "safety valves" there is nothing related to any Linux or cgroups:
I have even remove the "nobody" user from allowed and left the "nonsecure-mode.lcaol-user" empty, but still says "nobody". If I revert all the changes, it says "yarn". So these configs impact something somewhere.
Cloudera Express: 5.15.1
Java Version: 1.8.0_181
UPDATE: one more thing that might be useful, when I download client configuration from CM, I can't find these two configs in any of the configs. Not sure if that is normal.
Hi @Harsh J
You mentioning the Safe Valve gave me an idea! I thought maybe the UI in CM is not setting one or both of those key/values. So I did this manually and it worked! Now every container asked by Spark Pipe() has the same owner as the Spark application itself (no more nobody or yarn! - there must be something with the UI that won't map one of those two configs back to yarn-site.xml):
I had to fix this by changing the 'yarn' user to have a umask of '0'.
I would suggest adding this fix to the Cloudera Manager.