This question already has answers here:
What Exactly is Hash Collision
(4 answers)
Closed 9 months ago.
I've often heard that hash functions cant be reversed with the analogy of :take a number and add all the digits together, ex: 412 => 7 but you cant get your original value (412) back from 7. While this does make sense wouldnt it also imply that there are multiple inputs that give the same output?
A hash function is a versatile one-way cryptographic algorithm that maps an input of any size to a unique output of a fixed length of bits. The resulting output, which is known as a hash digest, hash value, or hash code, is the resulting unique identifier.
So answering your question: Yes, multiple inputs can give the same output if and only if the inputs are exactly the same. The analogy(take a number and add all the digits together) does not satisfy the part of the definition that states, that even if the pattern of characters/numbers is changed the hash created will be completely different.
Even if something tiny changed in an input — you capitalize a letter instead of using one that’s lowercase, or you swap an exclamation mark where there was a period — it’s going to result in the generation of an entirely new hash value and that’s the whole idea here — no matter how big or small a change, a completely different hash value gets created.
Related
I am generating random OTP-style strings that serve as a short-term identifier to link two otherwise unrelated systems (which have authentication at each end). These need to be read and re-entered by users, so in order to reduce the error rate and reduce the opportunities for forgery, I'd like to make one of the digits a check digit. At present my random string conforms to the pattern (removing I and O to avoid confusion):
^[ABCDEFGHJKLMNPQRSTUVWXYZ][0-9]{4}$
I want to append one extra decimal digit for the check. So far I've implemented this as a BLAKE2 hash (from libsodium) that's converted to decimal and truncated to 1 char. This gives only 10 possibilities for the check digit, which isn't much. My primary objective is to detect single character errors in the input.
This approach kind of works, but it seems that one digit is not enough to detect single char errors, and undetected errors are quite easy to find, for example K37705 and K36705 are both considered valid.
I do not have a time value baked into this OTP; instead it's purely random and I'm relying on keeping a record of the OTPs that have been generated recently for each user, which are deleted periodically, and I'm reducing opportunities for brute-forcing by rate and attempt-count limiting.
I'm guessing that BLAKE2 isn't a good choice here, but given there are only 10 possibilities for the result, I don't know that others will be better. What would be a better algorithm/approach to use?
Frame challenge
Why do you need a check digit?
It doesn't improve security, and a five digits is trivial for most humans to get correct. Check if server side and return an error message if it's wrong.
Normal TOTP tokens are commonly 6 digits, and actors such as google has determined that people in general manage to get them orrect.
This question already has answers here:
How do one-way hash functions work? (Edited)
(7 answers)
Is it possible to decrypt MD5 hashes?
(24 answers)
Closed 8 years ago.
I don't know much about cryptography, but have, of course, come across the magic md5() black box on many occasions. Now, after some searching, I have been able to find many questions along the lines of "Why can't md5 be reversed?" with answers along the lines of "Because the output is non-unique." Given that the password (or whatever value) is evaluated against the hash (not the original), knowing the original isn't really even important, all you would really need to gain access to the user's account is any input that will give that hash. So, my question:
Why can't md5 be reverse engineered to generate an input (any input, not necessarily the original) from an output (the actual function is, after all, publicly available)?
Example:
If I have a hash function that takes a string, assigns each character a number value, and then adds the values, I could easily reverse the function by taking the output, factoring it, and reorganizing the factors into a string (it won't be the original string, but it will give the same hash). So why can't md5 be run backwards?
It is a one way formula. Running MD5 or SHA1 on a particular string gives a hash that is always the same. It isn't possible to reverse the function to get back to the original string.
For example:
15 Mod 4 = 3
Even if you know the formula is
x Mod 4
you can't deduce x as it could be 3, 7, 11, 15 etc...
Obviously MD5 and SHA1 are a lot more complex!
In the above example, imputing 15 will always give you the answer of 3, but nobody would be able to deduce the original number. This does lead nicely on to collisions where multiple input strings could give the same hash:
http://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities
Wikipedia has information on the particular algorithm used:
http://en.wikipedia.org/wiki/MD5#Algorithm
The main thing is that multiple inputs can achieve the same output. Also, because the operations are chained together and each stage of the operation has this same behavior, each time you try to reverse "backward", there are multiple possible answers - many of them. That is the basic characteristic that makes it "one way".
My original Text : "sanjay"
SHA-1 Text : "25ecbcb559d14a98e4665d6830ac5c99991d7c25"
Now how can i get original value - "sanjay" from this hash value ?
is there any code or algorithm or method?
No. That's usually the point -- the process of hashing is normally one-way.
This is especially important for hashes designed for passwords or cryptology -- which differ from hashes designed, for say, hash-maps. Also, with an unbounded input length, there is an infinite amount of values which result in the same hash.
One method that can be used is to hash a bunch of values (e.g. brute-force from aaaaaaaa-zzzzzzz) and see which value has the same hash. If you have found this, you have found "the value" (the time is not cheap). "Rainbow tables" work on this idea (but use space instead of time), but are defeated with a nonce salt.
From what I've been taught on the subject, if you were the one that turned your value into a hash value, chances are you have full access to the hash function, and would be able to reverse it in the same way. If you only have the original value and the end value, and don't know what hash function was used, you can't really reverse it without doing what was said above (going over every possibility).
I am learning MD5. I found a term 'hash' in most description of MD5. I googled 'hash', but I could not find exact term of 'hash' in computer programming.
Why are we using 'hash' in computer programming? What is origin of the word??
I would say any answer must be guesswork, so I will make this a community wiki.
Hash, or hash browns, is breakfast food made from cutting potatoes into long thin strips (smaller than french fries, and shorter, but proportionally similar), then frying the mass of strips in animal or vegetable fat until browned, stuck together, and cooked. By analogy, 'hashing' a number meant turning it into another, usually smaller, number using a method which still deterministically depending on the input number.
I believe the term "hash" was first used in the context of "hash table", which was commonly used in the 1960's on mainframe-type machines. In these cases, usually an integer value with a large range is converted to a "hash table index" which is a small integer. It is important for an efficient hash table that the "hash function" be evenly distributed, or "flat."
I don't have a citation, that is how I have understood the analogy since I heard it in the 80's. Someone must have been there when the term was first applied, though.
A hash value (or simply hash), also
called a message digest, is a number
generated from a string of text. The
hash is substantially smaller than the
text itself, and is generated by a
formula in such a way that it is
extremely unlikely that some other
text will produce the same hash value.
You're refering to the "hash function". It is used to generate a unique value for a given set of parameters.
One great use of a hash is password security. Instead of saving a password in a database, you save a hash of the password.
A hash is supposed to be a unique combination of values from 00 to FF (hexadecimal) that represents a certain piece of data, be it a file or a string of bytes. It is used primarily for password storage and verification, and to test if a file is the same as another file (i.e. you hash two files, if they match, they're the same file).
Generally, any of the SHA algorithms are preferred over MD5, due to hash collisions that can occur when using it. See this Wikipedia article.
According to the Wikipedia article on hash functions, Donald Knuth in the Art of Computer Programming was able to trace the concept of hash functions back to an internal IBM memo by Hans Peter Luhn in 1953.
And just for fun, here's a scrap of overheard conversation quoted in Two Women in the Klondike: the Story of a Journey to the Gold Fields of Alaska (1899):
They'll have to keep the hash table going all day long to feed us. 'T will be a short order affair.
the hash function hashes input to a value, requires a salt value and no proof salt is needed to store. Indications are everybody says we must store the salt same time match and new still work. Mathematically related concept is bijection
adding to gabriel1836's answer, one of the important properties of hash function is that it is a one way function, which means you cannot generate the original string from its hash value.
A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.