I am facing a scenario where I am receiving 2500 files (in parquet format) of same structure on daily basis . I have to process all these files in a parallel fashion in py-spark. What is the best approach I should use to make sure that all the files are processed in parallel fashion?
By default your spark job will spawn one task for each file hence it will be highly parallel.
This will also be inefficient as task will take time to spawn for every file, if there are not enough executors to spawn 2500 task parallely ( or 2500X tasks , where X is number of days)
1. Try writing combineparquetfileinputformat : a http://bytepadding.com/big-data/spark/combineparquetfileinputformat/ so that one task can read multiple files located on same host or rack.
2. Run a merge job before reading the files.