I have a spark job in python which writes the data in HDFS. Currently job is running serially where if i have to write 50 files, it will write one by one. I am trying to parallelize it where i want to write all the files in parallel at the same time but it gives an error as below. I am trying to achieve this using python multiprocessing concept where if i will spawn 50 processes and each one will try to write in HDFS, but due to driver memory limit it is failing. Small files gets parallely written but when the file data gets huge it is failing. I understand that it is obviously due to limited driver memory but i am trying to check here of any alternate approach to meet this goal.
Container [pid=28341,containerID=container_e130_1583424819028_11130_01_000001] is running beyond physical memory limits. Current usage: 3.4 GB of 2.5 GB physical memory used; 36.1 GB of 5.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_e130_1583424819028_11130_01_000001 :