Support Questions

Find answers, ask questions, and share your expertise

Issue of container OOM when writing Dataframe to parquet files in Spark Job

New Contributor

I'm using Machine Learning Workspace in Cloudera Data Platform (CDP). I created a session with 4vCPU/16 GiB Memory and enabled Spark 3.2.0.

I'm using spark to load data of one month (the whole month data size is around 12 GB) and do some transformation, then write the data as parquet files on AWS S3.

My Spark session configuration looks like this:

SparkSession
.builder
.appName(appName)
.config("spark.driver.memory", "8G")
.config("spark.dynamicAllocation.enabled", "true")
.config("spark.dynamicAllocation.minExecutors", "4")
.config("spark.dynamicAllocation.maxExecutors", "20")
.config("spark.executor.cores", "4")
.config("spark.executor.memory", "8G")
.config("spark.sql.shuffle.partitions", 500)
......

Before the data are written to parquet files, they are repartitioned:

df.withColumn("salt", math.floor(rand() * 100))
.repartition("date_year", "date_month", "date_day", "salt")
.drop("salt").write.partitionBy("date_year", "date_month")
.mode("overwrite").parquet(SOME__PATH)

The data transformation with spark run sucessfully. But the spark job failed always in the last step when writing data to parquet files. 

Below is the example of the error message:

23/01/15 21:10:59 678 ERROR TaskSchedulerImpl: Lost executor 2 on 100.100.18.155: 
The executor with id 2 exited with exit code -1(unexpected).
The API gave the following brief reason: Evicted
The API gave the following message: Pod ephemeral local storage usage exceeds the total limit of containers 10Gi. 

I think there is no problem with my spark configuration. The problem is the configuration of kubenete ephemeral local storage size limitation, which I do not have the right to change it.

Can some one explain why this happened and what is is possbile solution for it?

1 REPLY 1

Super Collaborator

Hello @Ryan_2002 

 

Thanks for engaging Cloudera Community. First of all, Thank You for the detailed description of the Problem. I believe your ask is Valid, yet reviewing the same over a Community Post isn't a suitable approach. Feasible for you to engage Cloudera Support to allow our Team to work with you, with the suitability of Screen-Sharing Session as well as Logs exchange, both of which aren't feasible in Community. That would greatly expedite the review of your ask. 

 

Regards, Smarak