Picking a check digit algorithm - hash

I am generating random OTP-style strings that serve as a short-term identifier to link two otherwise unrelated systems (which have authentication at each end). These need to be read and re-entered by users, so in order to reduce the error rate and reduce the opportunities for forgery, I'd like to make one of the digits a check digit. At present my random string conforms to the pattern (removing I and O to avoid confusion):
^[ABCDEFGHJKLMNPQRSTUVWXYZ][0-9]{4}$
I want to append one extra decimal digit for the check. So far I've implemented this as a BLAKE2 hash (from libsodium) that's converted to decimal and truncated to 1 char. This gives only 10 possibilities for the check digit, which isn't much. My primary objective is to detect single character errors in the input.
This approach kind of works, but it seems that one digit is not enough to detect single char errors, and undetected errors are quite easy to find, for example K37705 and K36705 are both considered valid.
I do not have a time value baked into this OTP; instead it's purely random and I'm relying on keeping a record of the OTPs that have been generated recently for each user, which are deleted periodically, and I'm reducing opportunities for brute-forcing by rate and attempt-count limiting.
I'm guessing that BLAKE2 isn't a good choice here, but given there are only 10 possibilities for the result, I don't know that others will be better. What would be a better algorithm/approach to use?

Frame challenge
Why do you need a check digit?
It doesn't improve security, and a five digits is trivial for most humans to get correct. Check if server side and return an error message if it's wrong.
Normal TOTP tokens are commonly 6 digits, and actors such as google has determined that people in general manage to get them orrect.

Related

consistent hashing on Multiple machines

I've read the article: http://n00tc0d3r.blogspot.com/ about the idea for consistent hashing, but I'm confused about the method on multiple machines.
The basic process is:
Insert
Hash an input long url into a single integer;
Locate a server on the ring and store the key--longUrl on the server;
Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user.(How does this step work? In a single machine, there is a auto-increased id to calculate for shorten url, but what is the value to calculate for shorten url on multiple machines? There is no auto-increased id.)
Retrieve
Convert the shorten url back to the key using base conversion (from 62-base to 10-base);
Locate the server containing that key and return the longUrl. (And how can we locate the server containing the key?)
I don't see any clear answer on that page for how the author intended it. I think this is basically an exercise for the reader. Here's some ideas:
Implement it as described, with hash-table style collision resolution. That is, when creating the URL, if it already matches something, deal with that in some way. Rehashing or arithmetic transformation (eg, add 1) are both possibilities. This means, naively, a theoretical worst case of having to hit a server n times trying to find an available key.
There's a lot of ways to take that basic idea and smarten it, eg, just search for another available key on the same server, eg, by rehashing iteratively until you find one that's on the server.
Allow servers to talk to each other, and coordinate on the autoincrement id.
This is probably not a great solution, but it might work well in some situations: give each server (or set of servers) separate namespace, eg, the first 16 bits selects a server. On creation, randomly choose one. Then you just need to figure out how you want that namespace to map. The namespaces only really matter for who is allowed to create what IDs, so if you want to add nodes or rebalance later, it is no big deal.
Let me know if you want more elaboration. I think there's a lot of ways that this one could go. It is annoying that the author didn't elaborate on this point; my experience with these sorts of algorithms is that collision resolution and similar problems tend to be at the very heart of a practical implementation of a distributed system.

How Does The Google URL Shortener Generate A 5 Digit Hash Without Collisions

How can the Google URL shortener generate a unique hash with five characters without collisions. Seems like there are bound to be collisions, where different urls generate the same hash.
stackoverflow.com => http://goo.gl/LQysz
What's also interesting, is the same URL, generates a completely different hash each time:
stackoverflow.com => http://goo.gl/Dl7sz
So, doing some math, using lower-case characters, upper-case characters, and digits, the total number of combinations are 62^5 = 916,132,832 clearly collisions bound to happen.
How does Google do this?
They have a database which tracks all previously generated URLs and the longer URL that each of those maps to. Easy to make sure that newly generated URLs don't already exist in that table. A little tricky to scale out (they surely have multiple servers so each one needs to be assigned a bucket of values from which it can give out to users). If they ever reach the point of having generated 916,132,832 URLs, they'll just add another character.
They have a hash table with hash to url.
Count the number of rows in that table and encrypt it with a stream cipher then encode with base62.
Using a stream cipher instead of a hash will give you a short pseudo random output that doesn't collide with any previous output so you don't need to check the table.
It keeps track of previously used long URLs. This means that, when someone goes to create a short URL, if the place they are pointing to already has a short URL, it will just give them the pre-existing short URL.
Actually, it would be inefficient to have a system dedicated to creating 'hashes' based on a given set of data. Rather, the short URL is simply a random set of characters which has already been identified as ten digits, plus 26 lowercase letters, plus 26 uppercase letters = 916132832 permutations (not combinations). Random short URLs is the most efficient way to make it work, and that is why they are always different (though I suppose there could be some other component in the algorithm like the time of day, but I don't think it's worth it....there's no point in making it that complex; spending all of that processing power just to make a silly 5 character string which any monkey could do by pressing a button the right way on a permutation calculator).

hashing strings

I have streaming strings (text containing words and number).
Taking one line at a time for streaming strings, I would like to assign a unique value to them.
the examples may be:strings with their scores/hash
User1 logged in Comp1 port8087 1109
User2 logged in comp2 1135
user3 logged in port8080 1098
user1 logged in comp2 port8080 1178
these string should be in same cluster. For this what i have thought is mapping(bad type of hashing) the strings such that the small change in the string wont affect the score that much.
One simple way of doing that may be: taking UliCp8, Ulic .... ( i.e. 1st letter of each sentence) and find some way of scoring. After then the similar scored strings are kept in same bucket and later on sub group them.
The improved method would be: lets not take out first word of each word of the string but find some way to take representative value of the word such that the string representation may be quite suitable for mapping with score/hash as i mention.
Considering Levenstein distance or jaccard_index or some similarity distance metrices, all of them require inputting the strings for comparisions. Isn't there any method to hash/score the string as stated without going for comparisions.( POS tagging, comparing looks uneffiecient for my purpose as the data are streaming, huge in number, unstructured)
Hope you understand what i want to achieve and please help me out. Forgot about the comments below and lets restart.
"at least two similar word (not considering length) should have similar hash value"
This is against the most basic requirements for a hash function. With a hash function also minimal changes to the input should produce vehement changes to the bucket the hash falls into.
You are looking for an algorithm that calculates the similarity or distance between two inputs.
As stated you are not looking for a hash function, rather something like the Levenshtein distance which is an algorithm for calculating a metric representing the degree of differences between two sequences of bits. It is commonly used to find out how similar/dissimilar two strings are. Hashing / message digests are good for creating identifiers for unique, distinct values but they will produce entirely different results for "similar" values.
You are interested in the similarity of strings. Here is a nice post that names a few resources that are used for measuring string similarity. Maybe Lucene could help you in your situation.

Chance on generating 2 equal hashes based on time()?

Question is pretty simple, but I couldn't find an answer for this one... Basicly, my application is generating filenames with md5(time());.
What are the chances, if any, that using this technique, I'll have 2 equal results?
P.S. Since my question title says hashes not exact hash, what are the chances, if any, again, of generating equal results for each type of hashes sha1();, sha512(); etc.?
Thanks in advance!
My estimation is it is unsafe due to possible changes in time by humans and other processes such as NTP which FrankH has kindly noted. I highly recommend using a cryptographically secure RNG (random number generator) if your framework allows.
Equal results are unlikely to result from this, you can simply validate that yourself by checking the uniqueness of md5(0) ... md5(INT32_MAX) since that's the total range of a time_t. I don't think there are collisions in that input space for any of the hashes you've named.
Predictable results is another matter, though. By choosing time() as you input supplier, you restrict yourself to, well, one unique hash per second, no more than 86400 per day, ...

How can I generate a unique, small, random, and user-friendly key?

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.