Support Questions
Find answers, ask questions, and share your expertise

How to compare integer value between two huge datasets

How to compare integer value between two huge datasets

New Contributor

I have two data sets as below

Dataset-1:

ABC | 1 | 1.4

DEF | 1 | 2.5

GHI | 1 | 3.5

Dataset-2:

JKL | 0 | 1.5

MNI | 0 | 3

OPI | 0 | 7

The final field in two datasets should be compared and pick which is closest with first dataset from second data set. So, for every record in first dataset should be compared with second data set and pick which is closest on final field value, the difference between them should be minimal or 0. So finally, the output should be

ABC | 1 | 1.4 | JKL | 0 | 1.5

DEF | 1 | 2.5 | MNI | 0 | 3

GHI | 1 | 3.5 | MNI | 0 | 3

Pig or Spark is fine.

1 REPLY 1

Re: How to compare integer value between two huge datasets

New Contributor

I already tried with cross product and since the two data sets are so huge (4 million each) , cross product failing.

Am looking for ideas other than cross product. Thanks.