Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to process too many files of same structure in spark in parallel fashion

How to process too many files of same structure in spark in parallel fashion

Hi,

I am facing a scenario where I am receiving 2500 files (in parquet format) of same structure on daily basis . I have to process all these files in a parallel fashion in py-spark. What is the best approach I should use to make sure that all the files are processed in parallel fashion?

1 REPLY 1
Highlighted

Re: How to process too many files of same structure in spark in parallel fashion

By default your spark job will spawn one task for each file hence it will be highly parallel.
This will also be inefficient as task will take time to spawn for every file, if there are not enough executors to spawn 2500 task parallely ( or 2500X tasks , where X is number of days)

Various Approaches
1. Try writing combineparquetfileinputformat : a http://bytepadding.com/big-data/spark/combineparquetfileinputformat/ so that one task can read multiple files located on same host or rack.
2. Run a merge job before reading the files.

Don't have an account?
Coming from Hortonworks? Activate your account here