Support Questions
Find answers, ask questions, and share your expertise

Run Pyspark application inside Spark application

New Contributor

Due to a specific need, I am solving the problem of running a python application inside a scala application.

Here is the sample code of my applications.

 

Parent application (Scala):

 

import scala.sys.process._
import java.io.File
import java.nio.file.{Path => JavaPath}

class SparkApplication(spark: SparkSession) {

  def run(): Unit = {
    val command =
      "spark-submit"          ::
      s"--keytab $keytab"     ::
      "--principal my_username@DOMAIN.RU" ::
      "--master yarn"         ::
      "--deploy-mode cluster" ::
      s"$pyFile" :: Nil mkString " "

    command.!!
  }

  val containerPath: JavaPath = new File(SparkFiles.getRootDirectory()).toPath.toAbsolutePath

  val pyFile: File =  moveToContainer("spark.py")
  val keytab: File  = moveToContainer("my_username.keytab")

  def moveToContainer(fileName: String): File = {
    val fileSourceStream = getClass.getResourceAsStream(fileName)
    val file = containerPath.resolve(fileName).toFile
    FileUtils.copyInputStreamToFile(fileSourceStream, file)
    file
  }
}

 

 

Child application (spark.py):

 

from pyspark.sql import SparkSession

if __name__ == "__main__":

    spark = SparkSession.builder.appName('PysparkSubProcess').enableHiveSupport().getOrCreate()

    rdd = spark.sparkContext.parallelize([1,2,3,4,5,6])
    print("RDD count")
    print(rdd.count())

    df = spark.sql("select 1 as col1")
    df.write.format("parquet").saveAsTable("default.temp_pyspark_table")

 

 

So, as you can see above, I am successfully running a Scala application in cluster mode. This scala application runs a python application inside itself (spark.py ). Also, Kerberos is configured on our cluster. That's why I use my keytab file for authorization.

 

But that's not enough. While a Scala application has access to Have, Have remains unavailable for a Python application. That is, I can't save my test table from spark.py.

 

And the question probably is, is there any way to use authorization from a Scala application for a Python application? So that I don't have to worry about the keytab file and the Hive configuration for the Python application.

 

I have heard that there are authorization tokens that are created and stored on drivers. Is it possible to reuse such tokens? (--conf spark.executorEnv.HADOOP_TOKEN_FILE_LOCATION) Or maybe there are workarounds?

 

I still can't finish this problem.

0 REPLIES 0
; ;