Archives of Support Questions (Read Only)

raghavendran_c · ‎04-29-2016

Hi,

We have a HDP 2.3.2 cluster (around 50 nodes). We have many jobs that process millions of records of data every day (sometimes as high as a billion records a day). We need to assign a unique ID (UUID) for each of these records and are looking to use java.util.UUID.randomUUID() for this. From the documentation and wikipedia we see that randomUUID is good - but there is a very small chance that duplicates can be generated.

I checked the entropy of our machines and they are >150.

While we can be sure that randomUUID will work for now, is there guidance on when *not* to use randomUUID?

We don't want to go to a centralized service for ID generation as that will create bottlenecks.

Are there any other alternatives to generating UUIDs in the hadoop cluster? We have looked at SnowFlake, Flake & FaukxFlake - but are not yet convinced they will work for us.

Any pointers on this will be appreciated.

thanks,

Raga

bleonhardi · ‎04-29-2016

Its a good question, assuming the source of entropy is good the chances of a duplicate are essentially 0 ( randomUUID has 2^122 permutations which is roughly the number of atoms in the universe )

There are other ways too however, I assume there are some ready made solutions out there but how about using some old fashioned MapReduce:

Just one way: Assuming you could create all the UUIDs in one go and you had the data stored in a delimited format, you could create a unique key based on the long offset provided for each line by Textinputformat.

TextInputFormat provides lines of text together with a long offset ( bytes from the start using the split offsets ), so you could just add this to a starting number ( for example have a batchid that is steadily increased ) and create a unique number that way.

There are definitely other ways to do it too. For example going through a MapReduce jobid + taskid + rowinsplitid.

View solution in original post

bleonhardi · ‎04-29-2016