Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Any calculation to use number of mappers and containers?

avatar
Expert Contributor

1. To process 5 GB file, how many mappers are required? Is there any calculation to use number of mappers, reducers and containers?

2. How to improve the performance of distcp?

1 ACCEPTED SOLUTION

avatar
Super Collaborator

@kavitha velaga

1. Number of mappers depends on InputSplit of the file and hadoop launches mappers as much as reqired. User do not have direct control to set number of mapper via property.

2. To control the number of mapper, user has to control the number of inputsplit which is not necessary until there is requirement of custom logic.

3. User can control the number of reducer for

a MR job by setting this property : job.setNumReduceTasks(numOfReducer);

numOfReducer can have value from 0 to any positive integer.

if you choose 0 then MR job will be mapper only job(no reducer means no aggregation)

There are some usecases where Reducer is not necessary so putting numOfReducer=0 will make MR job to finish quickly (as job avoid shuffle and sorting).

4. Container size depends on how much memory your program would require in general.

5. Distcp - This ticket https://issues.apache.org/jira/browse/HDFS-7535 has improved distcp performance. To make distcp run quicker we might disable post copy check like checksum but then we trade-off with reliability.

Hope this helps

View solution in original post

2 REPLIES 2

avatar
Super Collaborator

Hi,

1) Number of mappers depends on various factors. primarily number of splits - mapreduce.input.fileinputformat.split.minsize & mapreduce.input.fileinputformat.split.maxsize

So a 5GB file configured to have max split size and min split size of 1GB will have 5 mappers. This is just an illustration.

See this for Recommended values -> https://community.hortonworks.com/questions/2179/recommended-config-mapreduceinputfileinputformatsp....

2) Number of containers depends on container size. Read this for calculation of container size

http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

3) Distcp - read this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_Sys_Admin_Guides/content/ref-7dbacce5-26...

Regards

Pranay Vyas

avatar
Super Collaborator

@kavitha velaga

1. Number of mappers depends on InputSplit of the file and hadoop launches mappers as much as reqired. User do not have direct control to set number of mapper via property.

2. To control the number of mapper, user has to control the number of inputsplit which is not necessary until there is requirement of custom logic.

3. User can control the number of reducer for

a MR job by setting this property : job.setNumReduceTasks(numOfReducer);

numOfReducer can have value from 0 to any positive integer.

if you choose 0 then MR job will be mapper only job(no reducer means no aggregation)

There are some usecases where Reducer is not necessary so putting numOfReducer=0 will make MR job to finish quickly (as job avoid shuffle and sorting).

4. Container size depends on how much memory your program would require in general.

5. Distcp - This ticket https://issues.apache.org/jira/browse/HDFS-7535 has improved distcp performance. To make distcp run quicker we might disable post copy check like checksum but then we trade-off with reliability.

Hope this helps