Support Questions

Find answers, ask questions, and share your expertise

Incorrect UDAF output

avatar
Explorer

Hi all,


After a brief search through the archives, I didn't find any reference to the problem I'm describing below, so I opened this new topic. We are coming across a very peculiar bug, involving incorrect output when using a UDAF. The bug shows up non-deterministically (for the most part), which is what makes it weirder.


Our codebase for this task consists of a Java class that implements the UDAF ( I'm aware it's deprecated ) and a respective HQL query. Our UDAF looks like the one below

 

public class ProblemClassUDAF extends UDAF {

  public static class Info implements Serializable{

    String x1 = null;
    String x2 = null;
    String x3 = null;

    public Info(){}

    @Override
    public int hashCode(){
      return this.x2.hashCode();
    }

    @Override
    public boolean equals( Object obj ){
      if ( this == obj )
        return true;

      if ( obj instanceof Info ){
        Info othInfo = (Info)obj;
        return this.x1.equals( othInfo.x1 ) && this.x2.equals( othInfo.x2 ) && this.x3.equals( othInfo.x3 );
      }
      return false;
    }
  }

public static class InternalEvaluator implements UDAFEvaluator{
private Set<Info> details = null; public InternalEvaluator(){ super(); this.init(); } @Override public void init(){ this.details = new HashSet<Info>(); } public boolean iterate( String x1, String x2, String x3 ){ Info info = new Info(); info.x1 = x1; info.x2 = x2; info.x3 = x3; return this.details.add( info ); } public Set<Info> terminatePartial(){ return this.details; } public boolean merge( List<Info> otherInfo ){ return this.details.addAll( otherInfo ); } public String terminate(){ StringBuilder sb = new StringBuilder(); for ( Info info : this.details ) sb.append( info.x1 ).append( "," ).append( info.x2 ).append( "," ).append( info.x3 ).append( "\t" ); return sb.toString();
}
}
}


 

As you can see, I have overriden both the hashCode() and the equals() method of the internal Info class. The variable used for hashing is never changed anywhere throughout the code. It is only initialized once, within the iterate() method and the object is added directly to the details set. I am only then referencing those variables when terminate() is called, to create the final output.


The query that I have looks like the following

 

CREATE TEMPORARY FUNCTION foo AS 'ProblemClassUDAF';
SELECT doc_id, foo( name, email, address )
FROM mytable
GROUP BY doc_id;


For the most part, the query returns the correct output. However, for some doc_id's some details will be missing, while in other cases, the same information of (name, email, address) will appear multiple times for a single doc_id. Note that this should not be happening, as the internal implementation of foo() uses a HashSet<> (that's a java.util.HashSet ) which does not allow duplicates. I will also point out (again), that both hashCode() and equals() methods have been implemented for the Info class shown above.


By logging additional information to figure out what's going on, I am seeing that the problem starts occuring within the merge() function.


Let's say that the details set contains the following 2 entries:

a) (x1="alice", x2="alice@gmail.com", x3="alice address")

b) (x1="bob", x2="bob@gmail.com", x3="bob address")


At one point, most likely through a different partial result, the merge() method is called with the provided list containing the same entries, but let's say in reverse order

1) (x1="bob", x2="bob@gmail.com", x3="bob address")

2) (x1="alice", x2="alice@gmail.com", x3="alice address")

Given the code I've provided above, the details set should not be adding those two entries, because they already exist. However, it is adding them and this is exactly what I do not understand!


Using further logging, I see that the hash codes for each entry are correctly computed (based on the email address). The surprising part is, though, that each item of the list is checked against the wrong entry in the set. In particular, let's say that "bob@gmail.com".hashCode() = 100 and "alice@gmail.com".hashCode() = 500. In the merge() method, before the addAll() method is called, the hashCodes are correct for both the old and the new values. Logging information in the equals() method though shows that the performed checks are the following:


"alice" != "bob" && "alice@gmail.com" != "bob@gmail.com" && "alice address" != "bob address"

"bob" != "alice" && "bob@gmail.com" != "alice@gmail.com" && "bob address" != "alice address"


which, of course, does not make sense as the hash codes should have routed the objects differently. As a result, my output for this particular instance contains 4 entries, when it should contain only 2. I have checked multiple times and there are no additional whitespaces that could alter the result. The situation is as described above.


Note: If, within the merge() method, I first add everything to another list and then pass them to the details set, then the final result correctly contains 2 items, not 4.


Is it possible that something very wrong is happening with the serialization of the objects (including the Set<> I'm using)?

The fact that adding everything from scratch works find would indicate something in that direction, but I can't put my finger on it.

Is this some weird bug that I've come across? I've noticed that others have mentioned something similar for spark, but not for HIVE.


Your help and feedback is very much appreciated.


Thanks!

George

1 ACCEPTED SOLUTION

avatar
Explorer

Maybe I've missed the documentation where this is discussed, but the problem seems to be that HIVE maintains a reference to the objects that it passes to the merge() function. I'm presenting here the solution, in case someone else comes across this error.

 

Therefore, the Info items in the List<Info> that is the parameter to the merge() may be overwritten (that is, their contents may change) after they've been passed to the method. This, of course, creates all kinds of problems w/ the Set<> as the contents change, while the object itself is not removed / reinserted to the set.

 

The proper way of handling this is by creating a new object and adding that to the set.

The merge() method should therefore be as follows:

 

    public boolean merge( List<Info> otherInfo ){
      for ( Info i : otherInfo ){
        Info copy = new Info();
        copy.x1 = i.x1;
        copy.x2 = i.x2;
        copy.x3 = i.x3;
        this.details.add( copy );
      }
    }

The rest of the code doesn't have to change.

 

Hope this helps and others don't need to waste as much time.

 

Cheers,

George

View solution in original post

2 REPLIES 2

avatar
Explorer

As additional information to the problem, if I set the number of reducers to 1, then the result is correct.

 

Given that, I'm leaning towards serialization being the root cause of the problem. Should my class ( Info ) implement or extend a specific interface / class? It's already implementing the Serializable interface and I've tried the Writable interafce too, but this didn't fix anything.

 

Thanks in advance!

George

avatar
Explorer

Maybe I've missed the documentation where this is discussed, but the problem seems to be that HIVE maintains a reference to the objects that it passes to the merge() function. I'm presenting here the solution, in case someone else comes across this error.

 

Therefore, the Info items in the List<Info> that is the parameter to the merge() may be overwritten (that is, their contents may change) after they've been passed to the method. This, of course, creates all kinds of problems w/ the Set<> as the contents change, while the object itself is not removed / reinserted to the set.

 

The proper way of handling this is by creating a new object and adding that to the set.

The merge() method should therefore be as follows:

 

    public boolean merge( List<Info> otherInfo ){
      for ( Info i : otherInfo ){
        Info copy = new Info();
        copy.x1 = i.x1;
        copy.x2 = i.x2;
        copy.x3 = i.x3;
        this.details.add( copy );
      }
    }

The rest of the code doesn't have to change.

 

Hope this helps and others don't need to waste as much time.

 

Cheers,

George