Uniqueness of UUID substring - substring

We track an internal entity with java.util generated UUID. New requirement is to pass this object to a third party who requires a unique identifier with a max character limit of 11. In lieu of generating, tracking and mapping an entirely new unique ID we are wondering if it is viable to use a substring of the UUID as a calculated field. The number of records is at most 10 million.
java.util.UUID.randomUUID().toString() // code used to generate
Quotes from other resources (incl. SOF):
"....only after generating 1 billion UUIDs every second for approximately 100 years would the probability of creating a single duplicate reach 50%."
"Also be careful with generating longer UUIDs and substring-ing them, since some parts of the ID may contain fixed bytes (e.g. this is the case with MAC, DCE and MD5 UUIDs)."
We will check out existing IDs' substrings for duplicates. What are the chances the substring would generate a duplicate?

This is an instance of the Birthday Problem. One formulation of B.P.: Given a choice of n values sampled randomly with replacement, how many values can we sample before the same value will be seen at least twice with probability p?
For the classic instance of the problem,
p = 0.5, n = the 365 days of the year
and the answer is 23. In other words, the odds are 50% that two people share the same birthday when you are surveying 23 people.
You can plug in
n = the number of possible UUIDs
instead to get that kind of cosmically large sample size required for a 50% probability of a collision — something like the billion-per-second figure. It is
n = 16^32
for a 32-character string of 16 case-insensitive hex digits.
B.P. a relatively expensive problem to compute, as there is no known closed-form formula for it. In fact, I just tried it for your 11-character substring (n = 16^11) on Wolfram Alpha Pro, and it timed out.
However, I found an efficient implementation of a closed-form estimate here. And here's my adaptation of the Python.
import math
def find(p, n):
return math.ceil(math.sqrt(2 * n * math.log(1/(1-p))))
If I plug in the classic B.P. numbers, I get an answer of 23, which is right. For the full UUID numbers,
find(.5, math.pow(16, 32)) / 365 / 24 / 60 / 60 / 100
my result is actually close to 7 billion UUID per second for 100 years! Maybe this estimate is too coarse for large numbers, though I don't know what method your source used.
For the 11-character string? You only have to generate about 5 million IDs total to reach the 50% chance of a collision. For 1%, it's only about 600,000 total. And that's probably overestimating safety, compared to your source (and which we are already guilty of by assuming the substring is random).
My engineering advice: Do you really need the guarantees that UUIDs provide aside from uniqueness, such as non-enumerability, and assurance against collisions in a distributed context? If not, then just use a sequential ID, and avoid these complications.

Related

Don't you get a random number after doing modulo on a hashed number?

I'm trying to understand hash tables, and from what I've seen the modulo operator is used to select which bucket a key will be placed in. I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation. Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
I'm sure I'm misunderstanding this. Can someone explain?
Don't you get a random number after doing modulo on a hashed number?
It depends on the hash function.
Say you have an identify hash for numbers - h(n) = n - then if the keys being hashed are generally incrementing numbers (perhaps with an occasional ommision), then after hashing they'll still generally hit successive buckets (wrapping at some point from the last bucket back to the first), with low collision rates overall. Not very random, but works out well enough. If the keys are random, it still works out pretty well - see the discussion of random-but-repeatable hashing below. The problem is when the keys are neither roughly-incrementing nor close-to-random - then an identity hash can provide terrible collision rates. (You might think "this is a crazy bad example hash function, nobody would do this; actually, most C++ Standard Library implementations' hash functions for integers are identity hashes).
On the other hand, if you have a hash function that say takes the address of the object being hashed, and they're all 8 byte aligned, then if you take the mod and the bucket count is also a multiple of 8, you'll only ever hash to every 8th bucket, having 8 times more collisions than you might expect. Not very random, and doesn't work out well. But, if the number of buckets is a prime, then the addresses will tend to scatter much more randomly over the buckets, and things will work out much better. This is the reason the GNU C++ Standard Library tends to use prime numbers of buckets (Visual C++ uses power-of-two sized buckets so it can utilise a bitwise AND for mapping hash values to buckets, as AND takes one CPU cycle and MOD can take e.g. 30-40 cycles - depending on your exact CPU - see here).
When all the inputs are known at compile time, and there's not too many of them, then it's generally possible to create a perfect hash function (GNU gperf software is designed specifically for this), which means it will work out a number of buckets you'll need and a hash function that avoids any collisions, but the hash function may take longer to run than a general purpose function.
People often have a fanciful notion - also seen in the question - that a "perfect hash function" - or at least one that has very few collisions - in some large numerical hashed-to range will provide minimal collisions in actual usage in a hash table, as indeed this stackoverflow question is about coming to grips with the falsehood of this notion. It's just not true if there are still patterns and probabilities in the way the keys map into that large hashed-to range.
The gold standard for a general purpose high-quality hash function for runtime inputs is to have a quality that you might call "random but repeatable", even before the modulo operation, as that quality will apply to the bucket selection as well (even using the dumber and less forgiving AND bit-masking approach to bucket selection).
As you've noticed, this does mean you'll see collisions in the table. If you can exploit patterns in the keys to get less collisions that this random-but-repeatable quality would give you, then by all means make the most of that. If not, the beauty of hashing is that with random-but-repeatable hashing your collisions are statistically related to your load factor (the number of stored elements divided by the number of buckets).
As an example, for separate chaining - when your load factor is 1.0, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.
I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation.
This is still true post-modulo. Minimising the same result means each post-modulo value has (about) the same number of keys mapping to it. We're particularly concerned about in-use keys stored in the table, if there's a non-uniform statistical distribution to the use of keys. With a hash function that exhibits the random-but-repeatable quality, there will be random variation in post-modulo mapping, but overall they'll be close enough to evenly balanced for most practical purposes.
Just to recap, let me address this directly:
Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
So:
random is good: if you get something like the random-but-repeatable hash quality, then your average hash collisions will statistically be capped at low levels, and in practice you're unlikely to ever see a particularly horrible collision chain, provided you keep the load factor reasonable (e.g. <= 1.0)
that said, your "near-perfect hash function...between 0 and 100,000" may or may not be high quality, depending on whether the distribution of values has patterns in it that would produce collisions. When in doubt about such patterns, use a hash function with the random-but-repeatable quality.
What would happen if you took a random number instead of using a hash function? Then doing the modulo on it? If you call rand() twice you can get the same number - a proper hash function doesn't do that I guess, or does it? Even hash functions can output the same value for different input.
This comment shows you grappling with the desirability of randomness - hopefully with earlier parts of my answer you're now clear on this, but anyway the point is that randomness is good, but it has to be repeatable: the same key has to produce the same pre-modulo hash so the post-modulo value tells you the bucket it should be in.
As an example of random-but-repeatable, imagine you used rand() to populate a uint32_t a[256][8] array, you could then hash any 8 byte key (e.g. including e.g. a double) by XORing the random numbers:
auto h(double d) {
uint8_t i[8];
memcpy(i, &d, 8);
return a[i[0]] ^ a[i[1]] ^ a[i[2]] ^ ... ^ a[i[7]];
}
This would produce a near-ideal (rand() isn't a great quality pseudo-random number generator) random-but-repeatable hash, but having a hash function that needs to consult largish chunks of memory can easily be slowed down by cache misses.
Following on from what [Mureinik] said, assuming you have a perfect hash function, say your array/buckets are 75% full, then doing modulo on the hashed function will probably result in a 75% collision probability. If that's true, I thought they were much better. Though I'm only learning about how they work now.
The 75%/75% thing is correct for a high quality hash function, assuming:
closed hashing / open addressing, where collisions are handled by finding an alternative bucket, or
separate chaining when 75% of buckets have one or more elements linked therefrom (which is very likely to mean the load factor (which many people may think of when you talk about how "full" the table is) is already significantly more than 75%)
Regarding "I thought they were much better." - that's actually quite ok, as evidenced by the percentages of colliding chain lengths mentioned earlier in my answer.
I think you have the right understanding of the situation.
Both the hash function and the number of buckets affect the chance of collisions. Consider, for example, the worst possible hash function - one that returns a constant value. No matter how many buckets you have, all the entries will be lumped to the same bucket, and you'd have a 100% chance of collision.
On the other hand, if you have a (near) perfect hash function, the number of buckets would be the main factor for the chance of collision. If your hash table has only 20 buckets, the minimal chance of collision will indeed be 1 in 20 (over time). If the hash values weren't uniformly spread, you'd have a much higher chance of collision in at least one of the buckets. The more buckets you have, the less chance of collision. On the other hand, having too many buckets will take up more memory (even if they are empty), and ultimately reduce performance, even if there are less collisions.

How can I calculate the impact on collision probability when truncating a hash?

I'd like to reduce an MD5 digest from 32 characters down to, ideally closer to 16. I'll be using this as a database key to retrieve a set of (public) user-defined parameters. I'm expecting the number of unique "IDs" to eventually exceed 10,000. Collisions are undesirable but not the end of the world.
I'd like to understand the viability of a naive truncation of the MD5 digest to achieve a shorter key. But I'm having trouble digging up a formula that I can understand (given I have a limited Math background), let alone use to determine the impact on collision probability that truncating the hash would have.
The shorter the better, within reason. I feel there must be a simple formula, but I'd rather have a definitive answer than do my own guesswork cobbled together from bits and pieces I have read around the web.
You can calculate the chance of collisions with this formula:
chance of collision = 1 - e^(-n^2 / (2 * d))
Where n is the number of messages, d is the number of possibilities, and e is the constant e (2.718281828...).
#mypetition's answer is great.
I found a few other equations that are more-or-less accurate and/or simplified here, along with a great explanation and a handy comparison of real-world probabilities:
1−e^((−k(k−1))/2N) - sample plot here
(k(k-1))/2N - sample plot here
k^2/2N - sample plot here
...where k is the number of ID's you'll be generating (the "messages") and N is the largest number that can be produced by the hash digest or the largest number that your truncated hexadecimal number could represent (technically + 1, to account for 0).
A bit more about "N"
If your original hash is, for example, "38BF05A71DDFB28A504AFB083C29D037" (32 hex chars), and you truncate it down to, say, 12 hex chars (e.g.: "38BF05A71DDF"), the largest number you could produce in hexadecimal is "0xFFFFFFFFFFFF" (281474976710655 - which is 16^12-1 (or 256^6 if you prefer to think in terms of bytes). But since "0" itself counts as one of the numbers you could theoretically produce, you add back that 1, which leaves you simply with 16^12.
So you can think of N as 16 ^ (numberOfHexDigits).

What is the shortest human-readable hash without collision?

I have a total number of W workers with long worker IDs. They work in groups, with a maximum of M members in each group.
To generate a unique group name for each worker combination, concantating the IDs is not feasible. I am think of doing a MD5() on the flattened sorted worker id list. I am not sure how many digits should I keep for it to be memorable to humans while safe from collision.
Will log( (26+10), W^M ) be enough ? How many redundent chars should I keep ? I there any other specialized hash function that works better for this scenario ?
The total number of combinations of 500 objects taken by up to 10 would be approximately 2.5091E+20, which would fit in 68 bits (about 13 characters in base36), but I don't see an easy algorithm to assign each combination a number. An easier algorithm would be like this: if you assign each person a 9-bit number (0 to 511) and concatenate up to 10 numbers, you would get 90 bits. To encode those in base36, you would need 18 characters.
If you want to use a hash that with just 6 characters in base36 (about 31 bits), the probability of a collision depends on the total number of groups used during the lifetime of the application. If we assume that each day there are 10 new groups (that were not encountered before) and that the application will be used for 10 years, we would get 36500 groups. Using the calculator provided by Nick Barnes shows that there is a 27% chance of a collision in this case. You can adjust the assumptions to your particular situation and then change the hash length to fit your desired maximum chance of a collision.

The hash table probability

I still confuse how to find hash table probability. I have hash table of size 20 with open addressing uses the hash function
hash(int x) = x % 20
How many elements need to be inserted in the hash table so that the probability of the next element hitting a collision exceeds 50%.
I use birthday paradox concerns to find it https://en.wikipedia.org/wiki/Birthday_problem and seems get an incorrect answer. Where is my mistake?
calculating
1/2=1-e^(-n^2/(2*20))
ln(1/2)=ln(e)*(-n^2/40)
-0.69314718=-n^2/40
n=scr(27.725887)=5.265538
How many elements need to be inserted in the hash table so that the probability of the next element hitting a collision exceeds 50%.
Well, it depends on a few things.
The simple case is that you've already performed 11 inserts with distinct and effectively random integer keys, such that 11 of the buckets are in use, and your next insertion uses another distinct and effectively random key so it will hash to any bucket with equal probability: clearly there's only a 9/20 chance of that bucket being unused which means your chance of a collision during that 12th insertion exceeds 50% for the first time. This is the answer most formulas, textbooks, people etc. will give you, as it's the most meaningful for situations where hash tables are used with strong hash functions and/or prime numbers of buckets etc. - the scenarios where hash tables shine and are particularly elegant.
Another not-uncommon scenario is that you're putting say customer ids for a business into the hash table, and you're assigning the customers incrementing id numbers starting at 1. Even if you've already inserted customers with ids 1 to 19, you know they're in buckets [1] to [19] with no collisions - your hash just passes the keys through without the mod kicking in. You can then insert customer 20 into bucket [0] (after the mod operation) without a collision. Then, the 21st customer has 100% chance of a collision. (But, if your data's like this, please use an array and index directly using the customer id, or customer_id - 1 if you don't want to waste bucket [0].)
There are many other possible patterns in the keys that can affect when you exceed a 50% probability of a collision: e.g. all the keys being odd or multiples of some value, or being say ages or heights with a particular distribution curve.
The mistake with your use of the Birthday Paradox is thinking it answers your question. When you put "1/2" and "20" into the formula, it's telling you that the point at which your cumulative probability of a collision reaches 1/2, but your question is "the probability of the next element hitting a collision exceeds 50%" (emphasis mine).

Specified Length Unique ID Generation

I need to create unique and random alphanumeric ID's of a set length. Ideally I would store a counter in my database starting at 0, and every time I need a unique ID I would get the counter value (0), run it through this hashing function giving it a set length (Probably 4-6 characters) [ID = Hash(Counter, 4);], it would return my new ID (ex. 7HU9), and then I would increment my counter (0++ = 1).
I need to keep the ID's short so they can be remembered or shared easily. Security isn't a big issue, so I'm not worried about people trying random ID's, but I don't want the ID's to be predictable, so there can't be an opportunity for a user to notice that the ID's increment by 3 every time allowing them to just work their way backwards through the ID's and download the ID data one-by-one (ex. A5F9, A5F6, A5F3, A5F0 == BAD).
I don't want to just loop through random strings checking for uniqueness since this would increase database load over time as key's are used up. The intention is that hashing a unique incrementing counter would guarantee ID uniqueness up to a certain counter value, at which point the length of the generated ID's would be increased by one and the counter reset, and continue this pattern forever.
Does anybody know of any hashing functions which would suit this need, or have any other ideas?
Edit: I do not need to be able to reverse the function to get the counter value back.
The tough part, as you realize, is getting to a no-collision sequence guaranteed.
If "not obvious" is the standard you need for guessing the algorithm, a simple mixed congruential RNG of full period - or rather a sequence of them with increasing modulus to satisfy the requirement for growth over time - might be what you want. This is not the hash approach you're asking for, but it ought to work.
This presentation covers the basics of MCRNGs and sufficient conditions for full period in a very concise form. There are many others.
You'd first use the lowest modulus MCRNG starting with an arbitrary seed until you've "used up" its cycle and then advance to the next largest modulus.
You will want to "step" the moduli to ensure uniqueness. For example if your first IDs are 12 bits and so you have a modulus M1 <= 2^12 (but not much less than), then you advance to 16 bits, you'd want to pick the second modulus M2 <= 2^16 - M1. So the second tier of id's would be M1+x_i where x_i is the i'th output of the second rng. A 32-bit third tier would have modulus 2^32-M2 and its output would be be M2+y_i, where y_i is its output, etc.
The only persistent storage required will be the last ID generated and the index of the MCRNG in the sequence.
Someone with time on their hands could guess this algorithm without too much trouble. But a casual user would be unlikely to do so.
Let's say that your counter is range from 1 to 10000. Slice [1, 10000] to 10 small unit, each unit contain 1000 number.These small unit will keep track of their last id.
unit-1 unit-2 unit-10
[1 1000], [1001, 2000], ... ,[9000, 10000]
When you need a ID, just random select from unit 1-10, and get the unit's newest ID.
e.g
At first, your counter is 1, random selection is unit-2, than you will get the ID=1001;
Second time, your counter is 2, random selection is unit-1, than you will get the ID=1;
Third time, your counter is 3, random selection is unit-2, than you will get the ID=1002;
...and so on.
(This was a while ago but I should write up what I ended up doing...)
The idea I came up with was actually pretty simple. I wanted alphanumeric pins, so that works out to 36 potential characters for each character, and I wanted to start with 4 character pins so that works out to 36^4 = 1,679,616 possible pins. I realized that all I wanted to do was take all of these possible pins and throw away a percentage of them in a random way such that a human being had a low chance of randomly finding one. So I divide 1,679,616 by 100 and then multiply my counter by a random number between 1 and 100 and then encode that number as my alphanumeric pin. Problem solved!
By guessing a random combination of 4 letters and numbers you have a 1 in 100 chance of actually guessing a real in-use pin, which is all I really wanted. In my implementation I increment the pin length once the available pin space is exhausted, and everything worked perfectly! Been running for about 2 years now!