Support Questions
Find answers, ask questions, and share your expertise

Run Pyspark application inside Spark application

New Contributor

Due to a specific need, I am solving the problem of running a python application inside a scala application.

Here is the sample code of my applications.


Parent application (Scala):


import scala.sys.process._
import java.nio.file.{Path => JavaPath}

class SparkApplication(spark: SparkSession) {

  def run(): Unit = {
    val command =
      "spark-submit"          ::
      s"--keytab $keytab"     ::
      "--principal my_username@DOMAIN.RU" ::
      "--master yarn"         ::
      "--deploy-mode cluster" ::
      s"$pyFile" :: Nil mkString " "


  val containerPath: JavaPath = new File(SparkFiles.getRootDirectory()).toPath.toAbsolutePath

  val pyFile: File =  moveToContainer("")
  val keytab: File  = moveToContainer("my_username.keytab")

  def moveToContainer(fileName: String): File = {
    val fileSourceStream = getClass.getResourceAsStream(fileName)
    val file = containerPath.resolve(fileName).toFile
    FileUtils.copyInputStreamToFile(fileSourceStream, file)



Child application (


from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession.builder.appName('PysparkSubProcess').enableHiveSupport().getOrCreate()

    rdd = spark.sparkContext.parallelize([1,2,3,4,5,6])
    print("RDD count")

    df = spark.sql("select 1 as col1")



So, as you can see above, I am successfully running a Scala application in cluster mode. This scala application runs a python application inside itself ( ). Also, Kerberos is configured on our cluster. That's why I use my keytab file for authorization.


But that's not enough. While a Scala application has access to Have, Have remains unavailable for a Python application. That is, I can't save my test table from


And the question probably is, is there any way to use authorization from a Scala application for a Python application? So that I don't have to worry about the keytab file and the Hive configuration for the Python application.


I have heard that there are authorization tokens that are created and stored on drivers. Is it possible to reuse such tokens? (--conf spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION) Or maybe there are workarounds?


I still can't finish this problem.

; ;