Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How to process too many files of same structure in spark in parallel fashion


I am facing a scenario where I am receiving 2500 files (in parquet format) of same structure on daily basis . I have to process all these files in a parallel fashion in py-spark. What is the best approach I should use to make sure that all the files are processed in parallel fashion?


By default your spark job will spawn one task for each file hence it will be highly parallel.
This will also be inefficient as task will take time to spawn for every file, if there are not enough executors to spawn 2500 task parallely ( or 2500X tasks , where X is number of days)

Various Approaches
1. Try writing combineparquetfileinputformat : a so that one task can read multiple files located on same host or rack.
2. Run a merge job before reading the files.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.