When user submits job via Spark/Samza to Yarn, job gets executed as "yarn" user, how can we make sure that job should run as same user who has submitted the job.
I believe we can do something like this:
For example if you are running spark shell then you can add below configurations in core-site.xml and run your job with --proxy-user <username>
<property> <name>hadoop.proxyuser.<username>.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.<username>.groups</name> <value>*</value> </property>Command to run spark shell with YARN with proxy user:
spark-shell --master yarn-client --proxy-user <username>
It didn't work for me. Am getting below exception
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): <proxyuser> tries to renew a token with renewer <loggeduser>
Is possible to follow above approach in Kerberos environment? I tried above step to run job as proxy user but it failed. Got GSS initialization exception. Any pointers?
note that even when running as OS user "yarn", an environment variable, "HADOOP_USER_NAME" passes the name of the account submitting the work into that process, which is then picked up by the HDFS client: the code should be able to work with HDFS directories as the submitter, with the same permissions and things. That is, as you may have guessed, completely insecure and open to abuse —for that you need to make the leap to Kerberos, I'm afraid.