Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

large data set processing

large data set processing

New Contributor

Running environment, 12 executors (8gb for each) with 8 cores. The data roughly 20gb(in parquet) need to process. Currently the first stage of each job takes 20-30mins to process which is not acceptable. The first stage including projection, union and local aggregate. And for each task in first stage, it takes nearly 7 mins (input size:23.9 MB/ 519401), is there any better practise that i can use to reduce the time for each task? Thanks