1) I am going to be receiving sets of unique user ids, with between 1 and 300 entries in each set.
2) Each unique user id is 6-10 characters long, and all lower case a-z ASCII
3) I need to generate a unique identifier for each set, max length 256 characters in the space
4) That unique identifier has, AFAIK, no bounds beyond "valid ASCII".
5) If two different sets generate the same unique group id, people in two unrelated groups will be able to see each others' information and I get to explain to Compliance why I thought that was a good idea. Let's call this the hashcode problem
6) If two identical sets don't produce the same id, a bunch of information will suddenly vanish for those users, and I will get to explain to Business why I thought that was a good idea. Let's call this the random guid problem.
Is there an identity hash that will meet my needs? Something like SHA-256 seems like it would be "safe enough", but it feels like the limited input space and the fact that there's no problem with the hash being reversible (compression not cryptography), would mean there's a compression function that could do the job.
Related
I am planning to implement my own very simple "hashing" formula to add a layer of security to an app with multiple users. My current plan is as follows:
User creates an account at which point an ID is generated on the backend. The ID is run through a formula (let's say ID * 57 + 8926 - 36 * 7, or something equally random). I then send back to the frontend the new user ID and the new "hashed" number and store them in localStorage.
User tries to access a secured area (let's say a settings page so they can change their own settings).
I send the backend two values: their ID and the hashed number. I run the ID through the same formula to check it matches the hashed value I've received. If the check passes, they can get in. So if someone has tried, say, changing their ID in localStorage to get access to another user's settings page, the only way they could achieve that is if they guess what the formula was. They could easily guess a user ID, but guess that the corresponding number is the result of ID * 57 + 8926 - 36 * 7 seems pretty unlikely.
I'm doing this because it would be quicker/cheaper than a db lookup for an actual hashed value... I think? Would it make more sense to use a package to create some kind of primary key/uuid instead of "hashing" my own value and doing a db lookup each time?
Tech stack: React on FE, Python on BE, SQL db.
I see a lot of posts saying "don't roll your own" -- is this absolute?
Yes it is. The reason being that whenever a non-cryptographer tries their hand at developing their own algorithms, they invariably fall into a multitude of pit holes which render the security of the algorithm to next to useless.
Your particular scheme, for example, can be trivially broken given two consecutive ID and "hash" pairs. (It's a simple arithmetic sequence, deriving the formula of an arithmetic sequence given two consecutive values is grade ~6 level math.)
I'm doing this because it would be quicker/cheaper than a db lookup for an actual hashed value...
The performance difference would probably be negligible. Don't worry about it.
If the information is not particularly sensitive, just assign each user a randomly generated 128 bit number. The chances of someone guessing a valid user's number are practically zero.
Two property of real hashes that you are missing with this are
a simple change in the input causes a large change in the output
all hashes have the same length
This could be a problem if a user somehow knows their own id and hash.
With your selfmade hash I could easily find out the hash of a random other user by reverse engeniering the hash.
I'd like to be able to create an algorithm that generates a 6 character confirmation code (e.g. A1JU2Z) that will be unique for a given (user, code) pair. The reason is, I'd like to keep the code at 6 characters, but using a trimmed set of alphanumerics (to avoid confusion with 1 and I, etc) only allows for ~300 million codes before collisions occur. Sure I may never need 300 million codes, but if I do, it will be a huge pain to go back and fix this.
So is there a way to utilize the user ... say their username, to generic unique codes such that if the same user wants to generate another code, its guaranteed that it is unique for them? (This is of course assuming a single user doesn't generate over 300 mill codes)
Thanks!
If the ID is unique only to the current user, you can just generate each character of the ID randomly. As long as the user is not expected to generate a large number of such IDs, you will have reasonable chance of not generating the same ID more than once (you need to do some math to get exact numbers for the expected chance of collision as the number of generated ID grow).
If you must not have collision at all cost, you need to either keep all previously generated IDs and do a comparison for the new one, or keep the count of the generated IDs (this requires a scheme where the ID generation is deterministic based on the count, but also unique -- a very simple case would be {ID=count; ++count;})
I think you can use a simple password generator like this : http://www.webtoolkit.info/php-random-password-generator.html
in combination with a check algorithm to be sure it is not already used.
$pass=generate_password();
$found=find_password($pass);
while($found){
$pass=generate_password();
$found=find_password($pass);
}
save_password($user,$code,$pass);
generate_password() is the function refered in the link.
find_password() is a function you have to write to check already generated codes in a database.
save_password() is a function you have to write to store the generated code in a database.
The code is in PHP, but the logic is here.
The password generator in the link is easy to understand, you can get 6 chars long, with the character rules you want.
How can the Google URL shortener generate a unique hash with five characters without collisions. Seems like there are bound to be collisions, where different urls generate the same hash.
stackoverflow.com => http://goo.gl/LQysz
What's also interesting, is the same URL, generates a completely different hash each time:
stackoverflow.com => http://goo.gl/Dl7sz
So, doing some math, using lower-case characters, upper-case characters, and digits, the total number of combinations are 62^5 = 916,132,832 clearly collisions bound to happen.
How does Google do this?
They have a database which tracks all previously generated URLs and the longer URL that each of those maps to. Easy to make sure that newly generated URLs don't already exist in that table. A little tricky to scale out (they surely have multiple servers so each one needs to be assigned a bucket of values from which it can give out to users). If they ever reach the point of having generated 916,132,832 URLs, they'll just add another character.
They have a hash table with hash to url.
Count the number of rows in that table and encrypt it with a stream cipher then encode with base62.
Using a stream cipher instead of a hash will give you a short pseudo random output that doesn't collide with any previous output so you don't need to check the table.
It keeps track of previously used long URLs. This means that, when someone goes to create a short URL, if the place they are pointing to already has a short URL, it will just give them the pre-existing short URL.
Actually, it would be inefficient to have a system dedicated to creating 'hashes' based on a given set of data. Rather, the short URL is simply a random set of characters which has already been identified as ten digits, plus 26 lowercase letters, plus 26 uppercase letters = 916132832 permutations (not combinations). Random short URLs is the most efficient way to make it work, and that is why they are always different (though I suppose there could be some other component in the algorithm like the time of day, but I don't think it's worth it....there's no point in making it that complex; spending all of that processing power just to make a silly 5 character string which any monkey could do by pressing a button the right way on a permutation calculator).
We would like to give each of users an alias so that we can refer to them in discussions while protecting their identity. These aliases should be unique.
The easy way would be to simply use a SERIAL column, but ints aren't memorable. We would like to use real people names so that we can remember the aliases.
The other easy way would be to find a list of first names somewhere, number them, and use a SERIAL to fetch names from the list. When the list runs out, add more names.
But is there some clever way to map ints to names?
We currently have about 2,000 users and are growing, but I doubt we'll ever become Google.
It may sound crazy. But there is an algorithm used in game programming to create meaningless but phonetically unique names like Alveolar, Bilabial, Glottal, Palatal, Velar.
Pick a random name from the Census Bureau's names file.
Have you tried any Hash functions? I am not sure whether they are available in Postgres. But yeah, one way to do is let the internal hash function take care. They will output unique IDs.
Back in "the day" Compuserve (or was it AOL?) used to give out temporary, initial passwords by having two lists of words and taking one word from each list and putting it together, so you would get something like EasyTomato or whatever. Perhaps something like that would work for your user base. If each word list has 256 characters, that's 65535 unique combinations (and notice how easily you can pick the combination by just incrementing a 16-bit integer).
EDIT: Well don't do a straight increment of the integer after all, or the first 256 people will all get the same first word, but the basic idea is still sound. Pick a random, not-yet-used 16-bit number. High 8 bits are your index into the first word list, low 8 bits are your index into the second word list.
A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.