Created 04-29-2016 02:28 PM
Hi,
We have a HDP 2.3.2 cluster (around 50 nodes). We have many jobs that process millions of records of data every day (sometimes as high as a billion records a day). We need to assign a unique ID (UUID) for each of these records and are looking to use java.util.UUID.randomUUID() for this. From the documentation and wikipedia we see that randomUUID is good - but there is a very small chance that duplicates can be generated.
I checked the entropy of our machines and they are >150.
While we can be sure that randomUUID will work for now, is there guidance on when *not* to use randomUUID?
We don't want to go to a centralized service for ID generation as that will create bottlenecks.
Are there any other alternatives to generating UUIDs in the hadoop cluster? We have looked at SnowFlake, Flake & FaukxFlake - but are not yet convinced they will work for us.
Any pointers on this will be appreciated.
thanks,
Raga
Created 04-29-2016 05:22 PM
Its a good question, assuming the source of entropy is good the chances of a duplicate are essentially 0 ( randomUUID has 2^122 permutations which is roughly the number of atoms in the universe )
There are other ways too however, I assume there are some ready made solutions out there but how about using some old fashioned MapReduce:
Just one way: Assuming you could create all the UUIDs in one go and you had the data stored in a delimited format, you could create a unique key based on the long offset provided for each line by Textinputformat.
TextInputFormat provides lines of text together with a long offset ( bytes from the start using the split offsets ), so you could just add this to a starting number ( for example have a batchid that is steadily increased ) and create a unique number that way.
There are definitely other ways to do it too. For example going through a MapReduce jobid + taskid + rowinsplitid.
Created 04-29-2016 05:22 PM
Its a good question, assuming the source of entropy is good the chances of a duplicate are essentially 0 ( randomUUID has 2^122 permutations which is roughly the number of atoms in the universe )
There are other ways too however, I assume there are some ready made solutions out there but how about using some old fashioned MapReduce:
Just one way: Assuming you could create all the UUIDs in one go and you had the data stored in a delimited format, you could create a unique key based on the long offset provided for each line by Textinputformat.
TextInputFormat provides lines of text together with a long offset ( bytes from the start using the split offsets ), so you could just add this to a starting number ( for example have a batchid that is steadily increased ) and create a unique number that way.
There are definitely other ways to do it too. For example going through a MapReduce jobid + taskid + rowinsplitid.