New Contributor
Posts: 1
Registered: ‎09-18-2017




I need some help to understand the behaviour of reduce side join code (ReduceSideJoinDemoReducer method ) that i have attached. I have also attached the console output.


I am new to Hadoop and trying to write a reduce side join based on 2 input files customer and transaction data and print out the customer data along with the transaction count for each customer.


In the reducer method, I am iterating through the values for a key. When the value is cust record I am setting up the output key value. When the record is a transaction record, I am incrementing the count.


In the Reducer Output I expect the Customor record to be printed along with the count of transactions. But I am getting the transaction record for certain keys.


On analysis I found that the last value in the Iterable list of reducer method input goes to the outputKey instead of the cust rec value (eventhough I set it up only when it is a cust rec). For example In the below example Transaction Record is the last record and that becomes the outputKey (eventhough I did not set it).

Below is the Snapshot of the log of processing 1 key/list of value in the reduce method. I expect the

values.type :class org.apache.hadoop.mapreduce.task.ReduceContextImpl$ValueIterable

Reducer input key :4000009
value : CustRec Malcolm,Wagner,39,Artist
is a Cust Rec
value : TransRec 00000011,06-18-2011,4000009,121.39,Outdoor Play Equipment,Swing Sets,Columbus,Ohio,credit
is a Trans Rec
value : TransRec 00000025,10-14-2011,4000009,144.20,Indoor Games,Darts,Phoenix,Arizona,credit
is a Trans Rec
value : TransRec 00000022,10-10-2011,4000009,019.64,Water Sports,Kitesurfing,Saint Paul,Minnesota,credit
is a Trans Rec
value : TransRec 00000023,05-02-2011,4000009,099.50,Gymnastics,Gymnastics Rings,Springfield,Illinois,credit
is a Trans Rec
value : TransRec 00000026,10-11-2011,4000009,031.58,Combat Sports,Wrestling,Orange,California,credit
is a Trans Rec
value : TransRec 00000012,02-08-2011,4000009,041.52,Indoor Games,Bowling,San Francisco,California,credit
is a Trans Rec

Values that will be written into Reducer out is: TransRec 00000012,02-08-2011,4000009,041.52,Indoor Games,Bowling,San Francisco,California,credit 6



class ReduceSideJoinDemoReducer extends Reducer<Text, Text, Text, Text>{
	private Text keyOut;
	private int valueOut;	
	public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{
		valueOut = 0;
		System.out.println("\nvalues.type :"+ values.getClass() );	
		System.out.println("\nReducer input key :"+ key );	
		for (Text value : values ){
			System.out.println("\tvalue :"+ value);
			//System.out.println("value.hasCode()" + (Object)value.hashCode());
			if (value.toString().matches(".*\tCustRec\t(.)*")) {
				System.out.println("\t\tis a Cust Rec");				
				keyOut = value;  // To Be Analyzed: How does "KeyOut" get the references to the last value of the "value" field in for loop 
				//keyOut = new Text(value); // If I create a new Text object, then the code works fine as expected. 
			}else if (value.toString().matches(".*\tTransRec\t(.)*")){
				System.out.println("\t\tis a Trans Rec");	
				valueOut += 1;
		System.out.println("\nValues that will be written into Reducer out is:" + keyOut.toString() + "\t" + new Text(valueOut+""));
		context.write(keyOut, new Text(valueOut+""));

Note that when I create a new Text object, the code works fine though as I expect. ( use keyOut = new Text(value); instead of keyOut = value at line no 109