Cloudera Community

Support Questions

Find answers, ask questions, and share your expertise

Advanced Search

Solved

balavignesh_nag

Guru

On what basis does a key value pair is generated?

How to decide the no of mappers and its associated reducers?

3,412 Views

1 ACCEPTED SOLUTION

Contributor

On what basis does a key value pair is generated?

It is depending on the data set and the required output.

In general, the key value pairs are to be specified in 4 places: Map Input, Map Output, Reduce Input and Reduce Output.

Map-Input:

By default it will take the line offset as the key and the content of the line will be the value as Text.

We can modify them by using custum input format.

Map-Output:

The basic responsibility of map is to filter the data and provide the environment for grouping of data based on key.

Key: It will be the field / text / object on which the data has to be grouped and aggregated on reducer side.

For example, if you want to find the maximum salary for each department then,

we have to group all the values of same department and send it to reduce. So, the department_name or department_id can be selected as the key.

Value: It will be the fields / text / object which are to be handled within each individual reduce method.

In the above example, we have to find the maximum salary for each department.

In each reduce method, all the salaries related to a specific key are available in Iterable format. We can find the maximum of those values because all values are related to one department.

Reduce-Input: It is same as Map-Output because the output of map is the input for reduce.

Reduce-Output:

These key value pairs are depending on the required output.

In our example, if the required output is like : department name salary

then the key for reducer might be the input key of reducer because its the department name and the value will be the calculated salary within the reduce logic.

if the required output is like : department name - salary

then, the key might be null and the value can be the concatenation of department name+"-"+salary.

How to decide the no of mappers and its associated reducers?

In general, one mapper will be created for each split. Suppose, if your data is less than 128 MB and the split size is also 128 MB then one mapper will be created. if your data is 200 MB then, 2 mappers will be created.

Number of reducers can be specified by programmer based on how many output files to be created and how many partitions we are using with in out program.

View solution in original post

2,636 Views

3 REPLIES 3

Contributor

On what basis does a key value pair is generated?

It is depending on the data set and the required output.

In general, the key value pairs are to be specified in 4 places: Map Input, Map Output, Reduce Input and Reduce Output.

Map-Input:

By default it will take the line offset as the key and the content of the line will be the value as Text.

We can modify them by using custum input format.

Map-Output:

The basic responsibility of map is to filter the data and provide the environment for grouping of data based on key.

Key: It will be the field / text / object on which the data has to be grouped and aggregated on reducer side.

For example, if you want to find the maximum salary for each department then,

we have to group all the values of same department and send it to reduce. So, the department_name or department_id can be selected as the key.

Value: It will be the fields / text / object which are to be handled within each individual reduce method.

In the above example, we have to find the maximum salary for each department.

In each reduce method, all the salaries related to a specific key are available in Iterable format. We can find the maximum of those values because all values are related to one department.

Reduce-Input: It is same as Map-Output because the output of map is the input for reduce.

Reduce-Output:

These key value pairs are depending on the required output.

In our example, if the required output is like : department name salary

then the key for reducer might be the input key of reducer because its the department name and the value will be the calculated salary within the reduce logic.

if the required output is like : department name - salary

then, the key might be null and the value can be the concatenation of department name+"-"+salary.

How to decide the no of mappers and its associated reducers?

In general, one mapper will be created for each split. Suppose, if your data is less than 128 MB and the split size is also 128 MB then one mapper will be created. if your data is 200 MB then, 2 mappers will be created.

Number of reducers can be specified by programmer based on how many output files to be created and how many partitions we are using with in out program.

2,637 Views

balavignesh_nag

Guru

Thanks Venkata Naga Balarama Murthy Pelluri. It gave clear understanding about the functionalities what I have expected. Yet I have few other questions? If my file size is one TB values falling under the same key has large no of values. In such condition one reducer will be loaded heavily? Is there a way that this work can be splitted across?

2,636 Views

Contributor

you can use combiners in this situation.

increasing number of reducers is another solution.

2,636 Views

Announcements

What's New @ Cloudera

[RELEASED] Cloudera Streaming Analytics - Kubernetes Operato...

What's New @ Cloudera

[RELEASED] Cloudera Streams Messaging - Kubernetes Operator ...

Community Announcements

February 2025 Community Highlights

What's New @ Cloudera

3 Benefits of External IDE Connectivity, Now Available in Cl...

What's New @ Cloudera

Performance comparison of Spark3 on YARN with S3 Standard VS...