Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Using java.util.UUID.randomUUID() for UUID generation

avatar
New Member

Hi,

We have a HDP 2.3.2 cluster (around 50 nodes). We have many jobs that process millions of records of data every day (sometimes as high as a billion records a day). We need to assign a unique ID (UUID) for each of these records and are looking to use java.util.UUID.randomUUID() for this. From the documentation and wikipedia we see that randomUUID is good - but there is a very small chance that duplicates can be generated.

I checked the entropy of our machines and they are >150.

While we can be sure that randomUUID will work for now, is there guidance on when *not* to use randomUUID?

We don't want to go to a centralized service for ID generation as that will create bottlenecks.

Are there any other alternatives to generating UUIDs in the hadoop cluster? We have looked at SnowFlake, Flake & FaukxFlake - but are not yet convinced they will work for us.

Any pointers on this will be appreciated.

thanks,

Raga

1 ACCEPTED SOLUTION

avatar
Master Guru

Its a good question, assuming the source of entropy is good the chances of a duplicate are essentially 0 ( randomUUID has 2^122 permutations which is roughly the number of atoms in the universe )

There are other ways too however, I assume there are some ready made solutions out there but how about using some old fashioned MapReduce:

Just one way: Assuming you could create all the UUIDs in one go and you had the data stored in a delimited format, you could create a unique key based on the long offset provided for each line by Textinputformat.

TextInputFormat provides lines of text together with a long offset ( bytes from the start using the split offsets ), so you could just add this to a starting number ( for example have a batchid that is steadily increased ) and create a unique number that way.

There are definitely other ways to do it too. For example going through a MapReduce jobid + taskid + rowinsplitid.

View solution in original post

1 REPLY 1

avatar
Master Guru

Its a good question, assuming the source of entropy is good the chances of a duplicate are essentially 0 ( randomUUID has 2^122 permutations which is roughly the number of atoms in the universe )

There are other ways too however, I assume there are some ready made solutions out there but how about using some old fashioned MapReduce:

Just one way: Assuming you could create all the UUIDs in one go and you had the data stored in a delimited format, you could create a unique key based on the long offset provided for each line by Textinputformat.

TextInputFormat provides lines of text together with a long offset ( bytes from the start using the split offsets ), so you could just add this to a starting number ( for example have a batchid that is steadily increased ) and create a unique number that way.

There are definitely other ways to do it too. For example going through a MapReduce jobid + taskid + rowinsplitid.