- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Any calculation to use number of mappers and containers?
- Labels:
-
Apache Hadoop
Created ‎05-08-2016 04:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. To process 5 GB file, how many mappers are required? Is there any calculation to use number of mappers, reducers and containers?
2. How to improve the performance of distcp?
Created ‎05-19-2016 09:55 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. Number of mappers depends on InputSplit of the file and hadoop launches mappers as much as reqired. User do not have direct control to set number of mapper via property.
2. To control the number of mapper, user has to control the number of inputsplit which is not necessary until there is requirement of custom logic.
3. User can control the number of reducer for
a MR job by setting this property : job.setNumReduceTasks(numOfReducer);
numOfReducer can have value from 0 to any positive integer.
if you choose 0 then MR job will be mapper only job(no reducer means no aggregation)
There are some usecases where Reducer is not necessary so putting numOfReducer=0 will make MR job to finish quickly (as job avoid shuffle and sorting).
4. Container size depends on how much memory your program would require in general.
5. Distcp - This ticket https://issues.apache.org/jira/browse/HDFS-7535 has improved distcp performance. To make distcp run quicker we might disable post copy check like checksum but then we trade-off with reliability.
Hope this helps
Created ‎05-08-2016 05:57 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
1) Number of mappers depends on various factors. primarily number of splits - mapreduce.input.fileinputformat.split.minsize & mapreduce.input.fileinputformat.split.maxsize
So a 5GB file configured to have max split size and min split size of 1GB will have 5 mappers. This is just an illustration.
See this for Recommended values -> https://community.hortonworks.com/questions/2179/recommended-config-mapreduceinputfileinputformatsp....
2) Number of containers depends on container size. Read this for calculation of container size
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
3) Distcp - read this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_Sys_Admin_Guides/content/ref-7dbacce5-26...
Regards
Pranay Vyas
Created ‎05-19-2016 09:55 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. Number of mappers depends on InputSplit of the file and hadoop launches mappers as much as reqired. User do not have direct control to set number of mapper via property.
2. To control the number of mapper, user has to control the number of inputsplit which is not necessary until there is requirement of custom logic.
3. User can control the number of reducer for
a MR job by setting this property : job.setNumReduceTasks(numOfReducer);
numOfReducer can have value from 0 to any positive integer.
if you choose 0 then MR job will be mapper only job(no reducer means no aggregation)
There are some usecases where Reducer is not necessary so putting numOfReducer=0 will make MR job to finish quickly (as job avoid shuffle and sorting).
4. Container size depends on how much memory your program would require in general.
5. Distcp - This ticket https://issues.apache.org/jira/browse/HDFS-7535 has improved distcp performance. To make distcp run quicker we might disable post copy check like checksum but then we trade-off with reliability.
Hope this helps
