I am submitted a spark job from a linuxVM(installed with a 1node HDP) to a real remote HDP cluster.
It ismentioned in Spark doc that
"Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. If the configuration references Java system properties or environment variables not managed by YARN, they should also be set in the Spark application’s configuration (driver, executors, and the AM when running in client mode)."
My question is, when I submit a job from a machine outside the cluster, obviously I need to put information about the address of the machine that YARN is running on, and so on. What does it mean it needs to "contains the (client side) configuration files" ?
Shall I set the HADOOP_CONF_DIR to my local(the VM) directory containing local version of the yarn-site.xml file (the client side files)? this does not make sense since I want to submit the job to the remote cluster.
I cannot find any more information for what files and what content should be included in the files, for the directory that HADOOP_CONF_DIR or YARN_CONF_DIR is pointing to.