Created 05-08-2016 04:32 PM
1. To process 5 GB file, how many mappers are required? Is there any calculation to use number of mappers, reducers and containers?
2. How to improve the performance of distcp?
Created 05-19-2016 09:55 AM
1. Number of mappers depends on InputSplit of the file and hadoop launches mappers as much as reqired. User do not have direct control to set number of mapper via property.
2. To control the number of mapper, user has to control the number of inputsplit which is not necessary until there is requirement of custom logic.
3. User can control the number of reducer for
a MR job by setting this property : job.setNumReduceTasks(numOfReducer);
numOfReducer can have value from 0 to any positive integer.
if you choose 0 then MR job will be mapper only job(no reducer means no aggregation)
There are some usecases where Reducer is not necessary so putting numOfReducer=0 will make MR job to finish quickly (as job avoid shuffle and sorting).
4. Container size depends on how much memory your program would require in general.
5. Distcp - This ticket https://issues.apache.org/jira/browse/HDFS-7535 has improved distcp performance. To make distcp run quicker we might disable post copy check like checksum but then we trade-off with reliability.
Hope this helps
Created 05-08-2016 05:57 PM
Hi,
1) Number of mappers depends on various factors. primarily number of splits - mapreduce.input.fileinputformat.split.minsize & mapreduce.input.fileinputformat.split.maxsize
So a 5GB file configured to have max split size and min split size of 1GB will have 5 mappers. This is just an illustration.
See this for Recommended values -> https://community.hortonworks.com/questions/2179/recommended-config-mapreduceinputfileinputformatsp....
2) Number of containers depends on container size. Read this for calculation of container size
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/
3) Distcp - read this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_Sys_Admin_Guides/content/ref-7dbacce5-26...
Regards
Pranay Vyas
Created 05-19-2016 09:55 AM
1. Number of mappers depends on InputSplit of the file and hadoop launches mappers as much as reqired. User do not have direct control to set number of mapper via property.
2. To control the number of mapper, user has to control the number of inputsplit which is not necessary until there is requirement of custom logic.
3. User can control the number of reducer for
a MR job by setting this property : job.setNumReduceTasks(numOfReducer);
numOfReducer can have value from 0 to any positive integer.
if you choose 0 then MR job will be mapper only job(no reducer means no aggregation)
There are some usecases where Reducer is not necessary so putting numOfReducer=0 will make MR job to finish quickly (as job avoid shuffle and sorting).
4. Container size depends on how much memory your program would require in general.
5. Distcp - This ticket https://issues.apache.org/jira/browse/HDFS-7535 has improved distcp performance. To make distcp run quicker we might disable post copy check like checksum but then we trade-off with reliability.
Hope this helps