New Contributor
Posts: 1
Registered: ‎11-21-2017

large data set processing

Running environment, 12 executors (8gb for each) with 8 cores. The data roughly 20gb(in parquet) need to process. Currently the first stage of each job takes 20-30mins to process which is not acceptable. The first stage including projection, union and local aggregate. And for each task in first stage, it takes nearly 7 mins (input size:23.9 MB/ 519401), is there any better practise that i can use to reduce the time for each task? Thanks