Support Questions

Find answers, ask questions, and share your expertise

How to get the count of last key value pair in mapreduce wordcount programme

avatar
Explorer

i,

I have been trying to do a word count programme, which emmits only 1 key value , ie the last key value pair in the input file using wordcount mapreduce programme.

Here is the content of the input file in a directory :

a.txt :
====

f f g h
i i j k
l l m r
f f h h

Content of b.txt
========
r r g h
h h m m
c c b b
d d r f

O/p should be :
r 4


Here is my sample mapper code & reducer code for simple word count. Can anyone tell me what changes should I make to get th o/p like above :

Mapper code:
--------------------
public class WcMapper extends Mapper<LongWritable,Text,Text,IntWritable>{

private static final IntWritable one= new IntWritable(1);

private final Text word=new Text();
public void map(LongWritable key,Text value, Context context
) throws IOException, InterruptedException
{

StringTokenizer st =new StringTokenizer(value.toString());
while(st.hasMoreTokens()){
word.set(st.nextToken());
context.write(word, one);

}

}


}


Reducer code :
---------------------

public void reduce(Text key,Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException{

int sum=0;

for(IntWritable value:values){
sum+= value.get();
}
context.write(key, new IntWritable(sum));
}

}

 

Driver code::
-----------------------
public class WcDriver extends Configured implements Tool{
public static void main(String[] args) throws Exception {

int status = ToolRunner.run(new WcDriver(), args);
System.exit(status);
}
@Override
public int run(String[] args) throws Exception {
Configuration c1=new Configuration();

Job j1= new Job(c1,"woc");
j1.setJarByClass(WcDriver.class);
j1.setMapperClass(WcMapper.class);

j1.setReducerClass(WcReducer.class);
j1.setInputFormatClass(TextInputFormat.class);
j1.setOutputFormatClass(TextOutputFormat.class);

j1.setOutputKeyClass(Text.class);
j1.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(j1, new Path(args[0]));
FileOutputFormat.setOutputPath(j1, new Path(args[1]));

FileSystem fs = FileSystem.newInstance(c1);
if (fs.exists(new Path(args[1]))) {
fs.delete(new Path(args[1]), true);
}

return j1.waitForCompletion(true) ? 0 : 1;
}

}


Appreciate all help.Please help....

1 ACCEPTED SOLUTION

avatar
Mentor
Thank you for expanding on the process - it was unclear from the word "last". What you meant was "largest", given the sorting involved.

What you are looking to do is only emit the largest (by value) key out, i.e. a MAX(…) behaviour in SQL for example.

This is simple to perform:

1. In the Mapper's setup call, initialise a zero-valued string (lowest ascii value) as the base key, along with a zeroed counter.
2. Across all map(…) calls keep track of if the current probable key is greater than the previous encountered key (beginning with the base key set above). Don't emit anything just yet - just keep reassigning the base key if its greater than the existing one (and reset the counter to 1). If its found equal, increment its counter.
3. In the cleanup(…) method, emit just the base key.
4. Given a MAX-like operation, configure a single reducer, and perform the very same max-tracking/final-emit within the setup(…), reduce(…) and cleanup(…) of the Reducer implementation, but take care to do the count aggregations before the compare, so you get the real count.

View solution in original post

4 REPLIES 4

avatar
Mentor
Am not sure I entirely follow your question - could you clarify on how you are imagining to arrive at a simple output of "r 4" from all of that input?

The word "last key value pair" doesn't quite make sense to me. Please elaborate?

avatar
Explorer

HI,

 

Normally as per the i/p I mentioned we should get the o/p as 

 

f 4

g 2

h 6

...

...

 r 4

 

 

But I need only the o/p as last key & its sum..ie 'r ' & its sum as 4.

 

How can we achieve this , anyway can we get only last key  & its count as o/p.?

avatar
Mentor
Thank you for expanding on the process - it was unclear from the word "last". What you meant was "largest", given the sorting involved.

What you are looking to do is only emit the largest (by value) key out, i.e. a MAX(…) behaviour in SQL for example.

This is simple to perform:

1. In the Mapper's setup call, initialise a zero-valued string (lowest ascii value) as the base key, along with a zeroed counter.
2. Across all map(…) calls keep track of if the current probable key is greater than the previous encountered key (beginning with the base key set above). Don't emit anything just yet - just keep reassigning the base key if its greater than the existing one (and reset the counter to 1). If its found equal, increment its counter.
3. In the cleanup(…) method, emit just the base key.
4. Given a MAX-like operation, configure a single reducer, and perform the very same max-tracking/final-emit within the setup(…), reduce(…) and cleanup(…) of the Reducer implementation, but take care to do the count aggregations before the compare, so you get the real count.

avatar
Explorer
Hi Harish,
thanks for your reply.


I have another doubt to ask you, how can we determine the no of mappers in the above mentioned wordcount programme. Can we determine that only using those 2 input files a.txt & b.txt ??. Is it mandatory that we should know the file size & block size?

Please help...