Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

What are best parameters to tune Spark when number of unique keys generated after map is in billions?

What are best parameters to tune Spark when number of unique keys generated after map is in billions?

New Contributor

What are best parameters to tune Spark when number of unique keys generated after map function is in billions? I have 5 nodes cluster where each node has 8 cores i7 processor and 8GB RAM? My input data size is 10.2 GB. After map function is done, the amount of intermediate data generated is around 40GB which will have 45 millions unique keys. I have to basically count number of occurences of every unique key.

Don't have an account?
Coming from Hortonworks? Activate your account here