We currently generate hashes for user-ids in our system (using md5) and bucket them into range buckets (range is 0 to 100, buckets can be 0-30, 31-70, 71-100). This process works this way - we calculate hash for a user as an int value, take the first three digits and translate that to a percentage. This percentage decides the bucket for the user. This works fine for now and the percentages are random over the range. Now we also want the bucketing to factor in another key for the user (his city). If I add the city as a salt to the hash, will the hash generate random buckets at the city level. For e.g. for users in seattle, if we use the hashstring as SEATTLE + user_id1, NEWYORK + user_id2 and calculate percentages in the same way as before. Will this lead to almost random percentages at the region level (all SEATTLE users are distributed randomly in buckets from 0-100)
Related
We track an internal entity with java.util generated UUID. New requirement is to pass this object to a third party who requires a unique identifier with a max character limit of 11. In lieu of generating, tracking and mapping an entirely new unique ID we are wondering if it is viable to use a substring of the UUID as a calculated field. The number of records is at most 10 million.
java.util.UUID.randomUUID().toString() // code used to generate
Quotes from other resources (incl. SOF):
"....only after generating 1 billion UUIDs every second for approximately 100 years would the probability of creating a single duplicate reach 50%."
"Also be careful with generating longer UUIDs and substring-ing them, since some parts of the ID may contain fixed bytes (e.g. this is the case with MAC, DCE and MD5 UUIDs)."
We will check out existing IDs' substrings for duplicates. What are the chances the substring would generate a duplicate?
This is an instance of the Birthday Problem. One formulation of B.P.: Given a choice of n values sampled randomly with replacement, how many values can we sample before the same value will be seen at least twice with probability p?
For the classic instance of the problem,
p = 0.5, n = the 365 days of the year
and the answer is 23. In other words, the odds are 50% that two people share the same birthday when you are surveying 23 people.
You can plug in
n = the number of possible UUIDs
instead to get that kind of cosmically large sample size required for a 50% probability of a collision — something like the billion-per-second figure. It is
n = 16^32
for a 32-character string of 16 case-insensitive hex digits.
B.P. a relatively expensive problem to compute, as there is no known closed-form formula for it. In fact, I just tried it for your 11-character substring (n = 16^11) on Wolfram Alpha Pro, and it timed out.
However, I found an efficient implementation of a closed-form estimate here. And here's my adaptation of the Python.
import math
def find(p, n):
return math.ceil(math.sqrt(2 * n * math.log(1/(1-p))))
If I plug in the classic B.P. numbers, I get an answer of 23, which is right. For the full UUID numbers,
find(.5, math.pow(16, 32)) / 365 / 24 / 60 / 60 / 100
my result is actually close to 7 billion UUID per second for 100 years! Maybe this estimate is too coarse for large numbers, though I don't know what method your source used.
For the 11-character string? You only have to generate about 5 million IDs total to reach the 50% chance of a collision. For 1%, it's only about 600,000 total. And that's probably overestimating safety, compared to your source (and which we are already guilty of by assuming the substring is random).
My engineering advice: Do you really need the guarantees that UUIDs provide aside from uniqueness, such as non-enumerability, and assurance against collisions in a distributed context? If not, then just use a sequential ID, and avoid these complications.
I'm just starting off Tableau and would like to do a count if in a for loop.
I have the following variables:
City
User
Round: takes values of either A or B
Amount
I would like to have a countif function that shows how many users received any positive amount in both round A and round B in a given city.
In my dashboard, each row represents a city, and I would like to have a column that shows the total number of users in each city that received amounts in both rounds.
Thanks!
You can go for a simple solution that works.
Create a calculated field called "Positive Rounds per User" using the below formula:
// counts the number of unique rounds that had positive amounts per user in a city
{ FIXED [User], [City]: COUNTD(IIF([Amount]>0, [Round], NULL))}
You can use the above to create another calculated field called "unique users":
// unique number of users that have 2 in "Positive Rounds per User" field
COUNTD(IIF([Positive Rounds per User]=2, [User], NULL))
You can combine the calculation of 1 and 2 into one but it gets complex to read so better to split them up
I have a total number of W workers with long worker IDs. They work in groups, with a maximum of M members in each group.
To generate a unique group name for each worker combination, concantating the IDs is not feasible. I am think of doing a MD5() on the flattened sorted worker id list. I am not sure how many digits should I keep for it to be memorable to humans while safe from collision.
Will log( (26+10), W^M ) be enough ? How many redundent chars should I keep ? I there any other specialized hash function that works better for this scenario ?
The total number of combinations of 500 objects taken by up to 10 would be approximately 2.5091E+20, which would fit in 68 bits (about 13 characters in base36), but I don't see an easy algorithm to assign each combination a number. An easier algorithm would be like this: if you assign each person a 9-bit number (0 to 511) and concatenate up to 10 numbers, you would get 90 bits. To encode those in base36, you would need 18 characters.
If you want to use a hash that with just 6 characters in base36 (about 31 bits), the probability of a collision depends on the total number of groups used during the lifetime of the application. If we assume that each day there are 10 new groups (that were not encountered before) and that the application will be used for 10 years, we would get 36500 groups. Using the calculator provided by Nick Barnes shows that there is a 27% chance of a collision in this case. You can adjust the assumptions to your particular situation and then change the hash length to fit your desired maximum chance of a collision.
I still confuse how to find hash table probability. I have hash table of size 20 with open addressing uses the hash function
hash(int x) = x % 20
How many elements need to be inserted in the hash table so that the probability of the next element hitting a collision exceeds 50%.
I use birthday paradox concerns to find it https://en.wikipedia.org/wiki/Birthday_problem and seems get an incorrect answer. Where is my mistake?
calculating
1/2=1-e^(-n^2/(2*20))
ln(1/2)=ln(e)*(-n^2/40)
-0.69314718=-n^2/40
n=scr(27.725887)=5.265538
How many elements need to be inserted in the hash table so that the probability of the next element hitting a collision exceeds 50%.
Well, it depends on a few things.
The simple case is that you've already performed 11 inserts with distinct and effectively random integer keys, such that 11 of the buckets are in use, and your next insertion uses another distinct and effectively random key so it will hash to any bucket with equal probability: clearly there's only a 9/20 chance of that bucket being unused which means your chance of a collision during that 12th insertion exceeds 50% for the first time. This is the answer most formulas, textbooks, people etc. will give you, as it's the most meaningful for situations where hash tables are used with strong hash functions and/or prime numbers of buckets etc. - the scenarios where hash tables shine and are particularly elegant.
Another not-uncommon scenario is that you're putting say customer ids for a business into the hash table, and you're assigning the customers incrementing id numbers starting at 1. Even if you've already inserted customers with ids 1 to 19, you know they're in buckets [1] to [19] with no collisions - your hash just passes the keys through without the mod kicking in. You can then insert customer 20 into bucket [0] (after the mod operation) without a collision. Then, the 21st customer has 100% chance of a collision. (But, if your data's like this, please use an array and index directly using the customer id, or customer_id - 1 if you don't want to waste bucket [0].)
There are many other possible patterns in the keys that can affect when you exceed a 50% probability of a collision: e.g. all the keys being odd or multiples of some value, or being say ages or heights with a particular distribution curve.
The mistake with your use of the Birthday Paradox is thinking it answers your question. When you put "1/2" and "20" into the formula, it's telling you that the point at which your cumulative probability of a collision reaches 1/2, but your question is "the probability of the next element hitting a collision exceeds 50%" (emphasis mine).
The "quality" of a hash is defined as the total number of comparisons needed to access every element once, relative to the expected number needed for a random hash. The value can go over 100%.
The total number of comparisons is equal to the sum of the squares of the number of entries in each bucket. For a random hash of "<n"> keys into "<k"> buckets, the expected value is:
n + n ( n - 1 ) / 2 * k
What exactly is the quality of hash??
It is a measure for how "evenly distributed" the hash is. Ideally, the hash function would place everything into its own bucket, but that does not happen because you cannot have that many buckets (and even then there are hash collisions, so that distinct values still end up in the same bucket).
The performance of the hash (ideally just going to up a bucket and looking at the single element in there) degrades when you have buckets with many elements in them: If that happens, you have to linearly go through all of them.
A quality of 100% is what you would expect for a hash filled with random data. In that case, all buckets should be equally full. If you have more than 100%, your data is unevenly hashed, and lookups take more time.