Firstly lets understand why we need partitioning inMapReduceFramework:
As we know that Map task take inputsplit as input and produces key,value pair as output. This key-value pairs are then feed to reduce task. But before the reduce phase , one more phase know as partitioning phase runs. This phase partition the map output based on key and keeps the record of the same key into the same partitions.
Lets take an example of Employee Analysis:
We want to find the highest paid Female and male employee from the data set.
Data Set:
Name Age Dept Gender Salary
A 23 IT Male 35
B 35 Finance Female 50
C 29 IT Male 40
Considering two map tasks gives following <k,v> as output:
Map1 o/p:
Key Value
Gender Value
Male A 23 IT Male 35
Female B 35 Finance Female 50
Map2 o/p
Key Value
Gender Value
Male C 29 IT Male 40
So , lets understand how to implement custom partitioner:
Our custom partitioner will send all <K,V> by Gender Male to one partition and <K,V> with Female to other partition .
here is the code:
public static class MyPartitioner extends Partitioner<Text,Text>{
public int getPartition(Text key, Text value, int numReduceTasks){
if(numReduceTasks==0)
return 0;
if(key.equals(new Text("Male")) )
return 0;
if(key.equals(new Text("Female")))
return 1;
}
}
Here , the getPartition() will return 0 if the key is Male and 1 if key is Female.
We can check our output in two files:
part-r-0000 and part-r-0001.