Member since
08-20-2019
2
Posts
0
Kudos Received
0
Solutions
11-21-2019
05:24 AM
When submitting a Spark job to a cluster which is configured with Kerberos authentication, we pass `--principal` and `--keytab` parameters to `spark-submit` which it uses to authenticate and generate a ticket for secure access to the Hadoop cluster. In order to debug issues with authentication, usually one would pass the JVM parameter `-Dsun.security.krb5.debug=true` when executing a `java` task to get verbose logs from the Kerberos libraries. How can such an option be passed to `spark-submit` itself, so that we can debug any issues with authentication? N.B. It is not sufficient to pass the flag to `spark.driver.extraJavaOptions` or `spark.executor.extraJavaOptions` because when running with `--deploy-mode cluster` the only code which runs on the client machine is the `spark-submit` code (the driver and executor, which would run on the cluster, wouldn't even be started in the case of a failure to authenticate to run on the secure cluster!).
... View more
Labels:
08-20-2019
12:05 PM
I would like to programmatically/dynamically configure the parameters (from *-site.xml) to submit Spark jobs with SparkLauncher so that the connection can be configured using a `org.apache.hadoop.conf.Configuration` object rather than a directory on disk (I have a library/API to create `Configuration` objects which are not just built from the local filesystem; but this could also be useful for testing). I'd like something which reads like: val hadoopConfiguration: org.apache.hadoop.conf.Configuration = getHadoopConfiguration() val spark = new SparkLauncher() spark.setConfiguration(hadoopConfiguration) In particular, it should set the hostnames etc. for connecting to YARN ResourceManager (usually from properties like) <property> <name> yarn.resourcemanager.ha.enabled </name> <value> true </value> </property> <property> <name> yarn.client.failover-proxy-provider </name> <value> org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider </value> </property> <property> <name> yarn.resourcemanager.ha.rm-ids </name> <value> rmA,rmB </value> </property> <property> <name> yarn.resourcemanager.zk-address </name> <value> XX1:2181,XX2:2181,XX3:2181,XX4:2181,XX5:2181 </value> </property> <property> <name> yarn.resourcemanager.hostname.rmA </name> <value> XX2 </value> </property> <property> <name> yarn.resourcemanager.address.rmA </name> <value> XX2:8032 </value> </property> This works fine for HBase connections with code like org.apache.hadoop.hbase.client.ConnectionFactory. createConnection (hadoopConfiguration) but looks like it may not be possible for SparkLauncher due to https://github.com/apache/spark/blob/05168e725d2a17c4164ee5f9aa068801ec2454f4/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L294-L300 which warns that " When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment." Is there a workaround for this, or plans to support it? The closest question I have found is https://community.cloudera.com/t5/Support-Questions/How-to-add-the-hadoop-and-yarn-configuration-file-to-the/m-p/126805.
... View more
Labels: