Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Generation of Key value pair

Solved Go to solution

Generation of Key value pair

On what basis does a key value pair is generated?

How to decide the no of mappers and its associated reducers?

1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted

Re: Generation of Key value pair

On what basis does a key value pair is generated?

It is depending on the data set and the required output.

In general, the key value pairs are to be specified in 4 places: Map Input, Map Output, Reduce Input and Reduce Output.

Map-Input:

By default it will take the line offset as the key and the content of the line will be the value as Text.

We can modify them by using custum input format.

Map-Output:

The basic responsibility of map is to filter the data and provide the environment for grouping of data based on key.

Key: It will be the field / text / object on which the data has to be grouped and aggregated on reducer side.

For example, if you want to find the maximum salary for each department then,

we have to group all the values of same department and send it to reduce. So, the department_name or department_id can be selected as the key.

Value: It will be the fields / text / object which are to be handled within each individual reduce method.

In the above example, we have to find the maximum salary for each department.

In each reduce method, all the salaries related to a specific key are available in Iterable format. We can find the maximum of those values because all values are related to one department.

Reduce-Input: It is same as Map-Output because the output of map is the input for reduce.

Reduce-Output:

These key value pairs are depending on the required output.

In our example, if the required output is like : department name salary

then the key for reducer might be the input key of reducer because its the department name and the value will be the calculated salary within the reduce logic.

if the required output is like : department name - salary

then, the key might be null and the value can be the concatenation of department name+"-"+salary.

How to decide the no of mappers and its associated reducers?

In general, one mapper will be created for each split. Suppose, if your data is less than 128 MB and the split size is also 128 MB then one mapper will be created. if your data is 200 MB then, 2 mappers will be created.

Number of reducers can be specified by programmer based on how many output files to be created and how many partitions we are using with in out program.

View solution in original post

3 REPLIES 3
Highlighted

Re: Generation of Key value pair

On what basis does a key value pair is generated?

It is depending on the data set and the required output.

In general, the key value pairs are to be specified in 4 places: Map Input, Map Output, Reduce Input and Reduce Output.

Map-Input:

By default it will take the line offset as the key and the content of the line will be the value as Text.

We can modify them by using custum input format.

Map-Output:

The basic responsibility of map is to filter the data and provide the environment for grouping of data based on key.

Key: It will be the field / text / object on which the data has to be grouped and aggregated on reducer side.

For example, if you want to find the maximum salary for each department then,

we have to group all the values of same department and send it to reduce. So, the department_name or department_id can be selected as the key.

Value: It will be the fields / text / object which are to be handled within each individual reduce method.

In the above example, we have to find the maximum salary for each department.

In each reduce method, all the salaries related to a specific key are available in Iterable format. We can find the maximum of those values because all values are related to one department.

Reduce-Input: It is same as Map-Output because the output of map is the input for reduce.

Reduce-Output:

These key value pairs are depending on the required output.

In our example, if the required output is like : department name salary

then the key for reducer might be the input key of reducer because its the department name and the value will be the calculated salary within the reduce logic.

if the required output is like : department name - salary

then, the key might be null and the value can be the concatenation of department name+"-"+salary.

How to decide the no of mappers and its associated reducers?

In general, one mapper will be created for each split. Suppose, if your data is less than 128 MB and the split size is also 128 MB then one mapper will be created. if your data is 200 MB then, 2 mappers will be created.

Number of reducers can be specified by programmer based on how many output files to be created and how many partitions we are using with in out program.

View solution in original post

Highlighted

Re: Generation of Key value pair

Thanks Venkata Naga Balarama Murthy Pelluri. It gave clear understanding about the functionalities what I have expected. Yet I have few other questions? If my file size is one TB values falling under the same key has large no of values. In such condition one reducer will be loaded heavily? Is there a way that this work can be splitted across?

Highlighted

Re: Generation of Key value pair

you can use combiners in this situation.

increasing number of reducers is another solution.

Don't have an account?
Coming from Hortonworks? Activate your account here