Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Python Spark job gives error -running beyond memory limits

Python Spark job gives error -running beyond memory limits

Explorer

Hi,

 

I have a spark job in python which writes the data in HDFS. Currently job is running serially where if i have to write 50 files, it will write one by one. I am trying to parallelize it where i want to write all the files in parallel at the same time but it gives an error as below. I am trying to achieve this using python multiprocessing concept where if i will spawn 50 processes and each one will try to write in HDFS, but due to driver memory limit it is failing. Small files gets parallely written but when the file data gets huge it is failing. I understand that it is obviously due to limited driver memory but i am trying to check here of any alternate approach to meet this goal.

 

Error details

Container [pid=28341,containerID=container_e130_1583424819028_11130_01_000001] is running beyond physical memory limits. Current usage: 3.4 GB of 2.5 GB physical memory used; 36.1 GB of 5.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_e130_1583424819028_11130_01_000001 :

 

 

Don't have an account?
Coming from Hortonworks? Activate your account here