Support Questions

Neyyu · ‎05-08-2016

1. To process 5 GB file, how many mappers are required? Is there any calculation to use number of mappers, reducers and containers?

2. How to improve the performance of distcp?

pradeep_bhadani · ‎05-19-2016

@kavitha velaga

1. Number of mappers depends on InputSplit of the file and hadoop launches mappers as much as reqired. User do not have direct control to set number of mapper via property.

2. To control the number of mapper, user has to control the number of inputsplit which is not necessary until there is requirement of custom logic.

3. User can control the number of reducer for

a MR job by setting this property : job.setNumReduceTasks(numOfReducer);

numOfReducer can have value from 0 to any positive integer.

if you choose 0 then MR job will be mapper only job(no reducer means no aggregation)

There are some usecases where Reducer is not necessary so putting numOfReducer=0 will make MR job to finish quickly (as job avoid shuffle and sorting).

4. Container size depends on how much memory your program would require in general.

5. Distcp - This ticket https://issues.apache.org/jira/browse/HDFS-7535 has improved distcp performance. To make distcp run quicker we might disable post copy check like checksum but then we trade-off with reliability.

Hope this helps

View solution in original post

PranayV · ‎05-08-2016

Hi,

1) Number of mappers depends on various factors. primarily number of splits - mapreduce.input.fileinputformat.split.minsize & mapreduce.input.fileinputformat.split.maxsize

So a 5GB file configured to have max split size and min split size of 1GB will have 5 mappers. This is just an illustration.

See this for Recommended values -> https://community.hortonworks.com/questions/2179/recommended-config-mapreduceinputfileinputformatsp....

2) Number of containers depends on container size. Read this for calculation of container size

http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

3) Distcp - read this https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_Sys_Admin_Guides/content/ref-7dbacce5-26...

Regards

Pranay Vyas

pradeep_bhadani · ‎05-19-2016

@kavitha velaga

1. Number of mappers depends on InputSplit of the file and hadoop launches mappers as much as reqired. User do not have direct control to set number of mapper via property.

2. To control the number of mapper, user has to control the number of inputsplit which is not necessary until there is requirement of custom logic.

3. User can control the number of reducer for

a MR job by setting this property : job.setNumReduceTasks(numOfReducer);

numOfReducer can have value from 0 to any positive integer.

if you choose 0 then MR job will be mapper only job(no reducer means no aggregation)

There are some usecases where Reducer is not necessary so putting numOfReducer=0 will make MR job to finish quickly (as job avoid shuffle and sorting).

4. Container size depends on how much memory your program would require in general.

5. Distcp - This ticket https://issues.apache.org/jira/browse/HDFS-7535 has improved distcp performance. To make distcp run quicker we might disable post copy check like checksum but then we trade-off with reliability.

Hope this helps

Cloudera Community

Support Questions

Any calculation to use number of mappers and containers?

Geo Distance calculations in Hive and Java

How are number of mappers determined for a query w...

How to control number of containers in a hive quer...

Hadoop LocalFileSystem Checksum calculation

Linux Container Executor reached unrecoverable exc...

Invoke Http with url containing %2F

For ORC File what determines the number of mappers...

Calculating Minimum Queue Capacity required for st...

Hive mapper not initializing

Identify number of Mappers & Reducers launched in ...