I'm using Machine Learning Workspace in Cloudera Data Platform (CDP). I created a session with 4vCPU/16 GiB Memory and enabled Spark 3.2.0.
I'm using spark to load data of one month (the whole month data size is around 12 GB) and do some transformation, then write the data as parquet files on AWS S3.
My Spark session configuration looks like this:
SparkSession
.builder
.appName(appName)
.config("spark.driver.memory", "8G")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.dynamicAllocation.minExecutors", "4")
.config("spark.dynamicAllocation.maxExecutors", "20")
.config("spark.executor.cores", "4")
.config("spark.executor.memory", "8G")
.config("spark.sql.shuffle.partitions", 500)
......
Before the data are written to parquet files, they are repartitioned:
df.withColumn("salt", math.floor(rand() * 100))
.repartition("date_year", "date_month", "date_day", "salt")
.drop("salt").write.partitionBy("date_year", "date_month")
.mode("overwrite").parquet(SOME__PATH)
The data transformation with spark run sucessfully. But the spark job failed always in the last step when writing data to parquet files.
Below is the example of the error message:
23/01/15 21:10:59 678 ERROR TaskSchedulerImpl: Lost executor 2 on 100.100.18.155:
The executor with id 2 exited with exit code -1(unexpected).
The API gave the following brief reason: Evicted
The API gave the following message: Pod ephemeral local storage usage exceeds the total limit of containers 10Gi.
I think there is no problem with my spark configuration. The problem is the configuration of kubenete ephemeral local storage size limitation, which I do not have the right to change it.
Can some one explain why this happened and what is is possbile solution for it?