Specified Length Unique ID Generation - hash

I need to create unique and random alphanumeric ID's of a set length. Ideally I would store a counter in my database starting at 0, and every time I need a unique ID I would get the counter value (0), run it through this hashing function giving it a set length (Probably 4-6 characters) [ID = Hash(Counter, 4);], it would return my new ID (ex. 7HU9), and then I would increment my counter (0++ = 1).
I need to keep the ID's short so they can be remembered or shared easily. Security isn't a big issue, so I'm not worried about people trying random ID's, but I don't want the ID's to be predictable, so there can't be an opportunity for a user to notice that the ID's increment by 3 every time allowing them to just work their way backwards through the ID's and download the ID data one-by-one (ex. A5F9, A5F6, A5F3, A5F0 == BAD).
I don't want to just loop through random strings checking for uniqueness since this would increase database load over time as key's are used up. The intention is that hashing a unique incrementing counter would guarantee ID uniqueness up to a certain counter value, at which point the length of the generated ID's would be increased by one and the counter reset, and continue this pattern forever.
Does anybody know of any hashing functions which would suit this need, or have any other ideas?
Edit: I do not need to be able to reverse the function to get the counter value back.

The tough part, as you realize, is getting to a no-collision sequence guaranteed.
If "not obvious" is the standard you need for guessing the algorithm, a simple mixed congruential RNG of full period - or rather a sequence of them with increasing modulus to satisfy the requirement for growth over time - might be what you want. This is not the hash approach you're asking for, but it ought to work.
This presentation covers the basics of MCRNGs and sufficient conditions for full period in a very concise form. There are many others.
You'd first use the lowest modulus MCRNG starting with an arbitrary seed until you've "used up" its cycle and then advance to the next largest modulus.
You will want to "step" the moduli to ensure uniqueness. For example if your first IDs are 12 bits and so you have a modulus M1 <= 2^12 (but not much less than), then you advance to 16 bits, you'd want to pick the second modulus M2 <= 2^16 - M1. So the second tier of id's would be M1+x_i where x_i is the i'th output of the second rng. A 32-bit third tier would have modulus 2^32-M2 and its output would be be M2+y_i, where y_i is its output, etc.
The only persistent storage required will be the last ID generated and the index of the MCRNG in the sequence.
Someone with time on their hands could guess this algorithm without too much trouble. But a casual user would be unlikely to do so.

Let's say that your counter is range from 1 to 10000. Slice [1, 10000] to 10 small unit, each unit contain 1000 number.These small unit will keep track of their last id.
unit-1 unit-2 unit-10
[1 1000], [1001, 2000], ... ,[9000, 10000]
When you need a ID, just random select from unit 1-10, and get the unit's newest ID.
e.g
At first, your counter is 1, random selection is unit-2, than you will get the ID=1001;
Second time, your counter is 2, random selection is unit-1, than you will get the ID=1;
Third time, your counter is 3, random selection is unit-2, than you will get the ID=1002;
...and so on.

(This was a while ago but I should write up what I ended up doing...)
The idea I came up with was actually pretty simple. I wanted alphanumeric pins, so that works out to 36 potential characters for each character, and I wanted to start with 4 character pins so that works out to 36^4 = 1,679,616 possible pins. I realized that all I wanted to do was take all of these possible pins and throw away a percentage of them in a random way such that a human being had a low chance of randomly finding one. So I divide 1,679,616 by 100 and then multiply my counter by a random number between 1 and 100 and then encode that number as my alphanumeric pin. Problem solved!
By guessing a random combination of 4 letters and numbers you have a 1 in 100 chance of actually guessing a real in-use pin, which is all I really wanted. In my implementation I increment the pin length once the available pin space is exhausted, and everything worked perfectly! Been running for about 2 years now!

Related

Uniqueness of UUID substring

We track an internal entity with java.util generated UUID. New requirement is to pass this object to a third party who requires a unique identifier with a max character limit of 11. In lieu of generating, tracking and mapping an entirely new unique ID we are wondering if it is viable to use a substring of the UUID as a calculated field. The number of records is at most 10 million.
java.util.UUID.randomUUID().toString() // code used to generate
Quotes from other resources (incl. SOF):
"....only after generating 1 billion UUIDs every second for approximately 100 years would the probability of creating a single duplicate reach 50%."
"Also be careful with generating longer UUIDs and substring-ing them, since some parts of the ID may contain fixed bytes (e.g. this is the case with MAC, DCE and MD5 UUIDs)."
We will check out existing IDs' substrings for duplicates. What are the chances the substring would generate a duplicate?
This is an instance of the Birthday Problem. One formulation of B.P.: Given a choice of n values sampled randomly with replacement, how many values can we sample before the same value will be seen at least twice with probability p?
For the classic instance of the problem,
p = 0.5, n = the 365 days of the year
and the answer is 23. In other words, the odds are 50% that two people share the same birthday when you are surveying 23 people.
You can plug in
n = the number of possible UUIDs
instead to get that kind of cosmically large sample size required for a 50% probability of a collision — something like the billion-per-second figure. It is
n = 16^32
for a 32-character string of 16 case-insensitive hex digits.
B.P. a relatively expensive problem to compute, as there is no known closed-form formula for it. In fact, I just tried it for your 11-character substring (n = 16^11) on Wolfram Alpha Pro, and it timed out.
However, I found an efficient implementation of a closed-form estimate here. And here's my adaptation of the Python.
import math
def find(p, n):
return math.ceil(math.sqrt(2 * n * math.log(1/(1-p))))
If I plug in the classic B.P. numbers, I get an answer of 23, which is right. For the full UUID numbers,
find(.5, math.pow(16, 32)) / 365 / 24 / 60 / 60 / 100
my result is actually close to 7 billion UUID per second for 100 years! Maybe this estimate is too coarse for large numbers, though I don't know what method your source used.
For the 11-character string? You only have to generate about 5 million IDs total to reach the 50% chance of a collision. For 1%, it's only about 600,000 total. And that's probably overestimating safety, compared to your source (and which we are already guilty of by assuming the substring is random).
My engineering advice: Do you really need the guarantees that UUIDs provide aside from uniqueness, such as non-enumerability, and assurance against collisions in a distributed context? If not, then just use a sequential ID, and avoid these complications.

Calculate pseudo random number based on an increasing value

I need to calculate a pseudo random number in a given range (e.g. 0-150) based on another, strictly increasing number. Is there a mathematical way to solve this?
I am given one number x, which increases by 1 every day. Based on this number, I need to - somehow - calculate a number in a given range, which seems to be random.
I have a feeling that there is an easy mathematical solution for this, but sadly I am not able to find it. So any help would be appreciated. Thanks!
One sound way to do that is to hash the number x (either its binary representation or in text form) and then to use the hash to produce the 'random' number in the desired range (say by taking the first 32 bits of the hash and extracting by any known method the desired value). A cryptographic hash can be used like Sha256, but this is not necessary, MurmurHash is possibly a good one for your application.
Normally when you generate a random number, a seed value is used so that the same sequence of psuedorandom numbers isn't repeated. When a seed isn't explicitly given, many systems will use the time as a seed value.
Perhaps you could use x as a seed.
Here's an article explaining seeding: https://www.statisticshowto.com/random-seed-definition/

What is the shortest human-readable hash without collision?

I have a total number of W workers with long worker IDs. They work in groups, with a maximum of M members in each group.
To generate a unique group name for each worker combination, concantating the IDs is not feasible. I am think of doing a MD5() on the flattened sorted worker id list. I am not sure how many digits should I keep for it to be memorable to humans while safe from collision.
Will log( (26+10), W^M ) be enough ? How many redundent chars should I keep ? I there any other specialized hash function that works better for this scenario ?
The total number of combinations of 500 objects taken by up to 10 would be approximately 2.5091E+20, which would fit in 68 bits (about 13 characters in base36), but I don't see an easy algorithm to assign each combination a number. An easier algorithm would be like this: if you assign each person a 9-bit number (0 to 511) and concatenate up to 10 numbers, you would get 90 bits. To encode those in base36, you would need 18 characters.
If you want to use a hash that with just 6 characters in base36 (about 31 bits), the probability of a collision depends on the total number of groups used during the lifetime of the application. If we assume that each day there are 10 new groups (that were not encountered before) and that the application will be used for 10 years, we would get 36500 groups. Using the calculator provided by Nick Barnes shows that there is a 27% chance of a collision in this case. You can adjust the assumptions to your particular situation and then change the hash length to fit your desired maximum chance of a collision.

The hash table probability

I still confuse how to find hash table probability. I have hash table of size 20 with open addressing uses the hash function
hash(int x) = x % 20
How many elements need to be inserted in the hash table so that the probability of the next element hitting a collision exceeds 50%.
I use birthday paradox concerns to find it https://en.wikipedia.org/wiki/Birthday_problem and seems get an incorrect answer. Where is my mistake?
calculating
1/2=1-e^(-n^2/(2*20))
ln(1/2)=ln(e)*(-n^2/40)
-0.69314718=-n^2/40
n=scr(27.725887)=5.265538
How many elements need to be inserted in the hash table so that the probability of the next element hitting a collision exceeds 50%.
Well, it depends on a few things.
The simple case is that you've already performed 11 inserts with distinct and effectively random integer keys, such that 11 of the buckets are in use, and your next insertion uses another distinct and effectively random key so it will hash to any bucket with equal probability: clearly there's only a 9/20 chance of that bucket being unused which means your chance of a collision during that 12th insertion exceeds 50% for the first time. This is the answer most formulas, textbooks, people etc. will give you, as it's the most meaningful for situations where hash tables are used with strong hash functions and/or prime numbers of buckets etc. - the scenarios where hash tables shine and are particularly elegant.
Another not-uncommon scenario is that you're putting say customer ids for a business into the hash table, and you're assigning the customers incrementing id numbers starting at 1. Even if you've already inserted customers with ids 1 to 19, you know they're in buckets [1] to [19] with no collisions - your hash just passes the keys through without the mod kicking in. You can then insert customer 20 into bucket [0] (after the mod operation) without a collision. Then, the 21st customer has 100% chance of a collision. (But, if your data's like this, please use an array and index directly using the customer id, or customer_id - 1 if you don't want to waste bucket [0].)
There are many other possible patterns in the keys that can affect when you exceed a 50% probability of a collision: e.g. all the keys being odd or multiples of some value, or being say ages or heights with a particular distribution curve.
The mistake with your use of the Birthday Paradox is thinking it answers your question. When you put "1/2" and "20" into the formula, it's telling you that the point at which your cumulative probability of a collision reaches 1/2, but your question is "the probability of the next element hitting a collision exceeds 50%" (emphasis mine).

How to check analog values to see if they have varied more than 1V in the last 5 min?

I have an AB PLC where I am trying to read analog values to see if the values vary more than 1V in 5 minutes? I have 10 sets of values I need to read. What would the easiest way to implement this? I can think of creating arrays to save the values each time I read them but the part I am having trouble with is, how to keep a running average of the values and compare against each time I read them.
Any help with this would be greatly appreciated!!
If I understand correctly all you want to do is see if your analog input is more or less than 1V from your set value? Just check if your value is greater than (set value + 1V) or less than (set value - 1V) every plc scan then set a bool value to true. That should be it.
I think finding an average of the analog input is not the way to go for this. But if you did want to find an average of an analog input over time you would need 3 things. Sample time, interval time, and total intervals. You would set up a sample time of, lets say 12 seconds. You will get the analog value every 12 seconds. After 60 seconds you would take the total and divide by (60/12 == 5). You would then add that value to the previous value average value that you totaled up and divide by the total number of intervals times (total intervals) you have accumulated. Hope I didn't make that to complicated.
What i understood from you question is you want check whether input voltage changed or not using the analog value you got, in my case i'm using 0 to 10v. Just simple store the analog value at max input i mean at 10v and just do the same for 0v and you can simply calculate the value for 1v. All you have do is compare the value with +/- 1v value you got from the calculation. you can do this dynamically with n-number of analog inputs(n= max analog inputs supported by your PLC.)
Have a look at FFL and FFU. They are First-In-First-Out buffers. You specify the length of the buffer you want and use FFL and FFU in pairs on the same buffer. Running averages are not that difficult to compute, and there are a number of ways to best implement depending on the platform (SLC vs CLX). The simplest method that would work on both platforms is to use a counter.ACC as a value to indirectly reference the element number of the FIFO for an addition function, then divide by the number of elements in your FIFO. This can all be done in a single multi-branch rung.
1. Load your value into FIFO buffer at some timer interval using FFL.
2. If you don't need the FIFO values 'Popped out' for use elsewhere, just set .POS to 0 when the FIFO is full and let it continue to update with new values, the values aren't cleared so they are still readable for your Running Average. But you MUST either use FFU to step the .POS back or use a MOV function to change the .POS once it's full or it will stop taking values.
3. Create a counter with a .PRE equal to the .LEN of your FIFO
4. On a parallel Rung, with each increment of the counter.ACC use an ADD function. Here's an example assuming CLX. If you're using SLC you can do the same thing but obviously you can't use tag names:
ADD
Value1: AllValues
Value2: FIFO[IndexCounter.ACC]
Destination: AllValues
5. When your counter.DN bit is set, divide AllValues by FIFO.LEN and store in a RunningAverage Tag, then reset the counter. Have your counter step once for each scan or put it all in a Periodic Function to execute the routine.