This question already has answers here:
How do one-way hash functions work? (Edited)
(7 answers)
Is it possible to decrypt MD5 hashes?
(24 answers)
Closed 8 years ago.
I don't know much about cryptography, but have, of course, come across the magic md5() black box on many occasions. Now, after some searching, I have been able to find many questions along the lines of "Why can't md5 be reversed?" with answers along the lines of "Because the output is non-unique." Given that the password (or whatever value) is evaluated against the hash (not the original), knowing the original isn't really even important, all you would really need to gain access to the user's account is any input that will give that hash. So, my question:
Why can't md5 be reverse engineered to generate an input (any input, not necessarily the original) from an output (the actual function is, after all, publicly available)?
Example:
If I have a hash function that takes a string, assigns each character a number value, and then adds the values, I could easily reverse the function by taking the output, factoring it, and reorganizing the factors into a string (it won't be the original string, but it will give the same hash). So why can't md5 be run backwards?
It is a one way formula. Running MD5 or SHA1 on a particular string gives a hash that is always the same. It isn't possible to reverse the function to get back to the original string.
For example:
15 Mod 4 = 3
Even if you know the formula is
x Mod 4
you can't deduce x as it could be 3, 7, 11, 15 etc...
Obviously MD5 and SHA1 are a lot more complex!
In the above example, imputing 15 will always give you the answer of 3, but nobody would be able to deduce the original number. This does lead nicely on to collisions where multiple input strings could give the same hash:
http://en.wikipedia.org/wiki/MD5#Collision_vulnerabilities
Wikipedia has information on the particular algorithm used:
http://en.wikipedia.org/wiki/MD5#Algorithm
The main thing is that multiple inputs can achieve the same output. Also, because the operations are chained together and each stage of the operation has this same behavior, each time you try to reverse "backward", there are multiple possible answers - many of them. That is the basic characteristic that makes it "one way".
Related
This question already has answers here:
What Exactly is Hash Collision
(4 answers)
Closed 9 months ago.
I've often heard that hash functions cant be reversed with the analogy of :take a number and add all the digits together, ex: 412 => 7 but you cant get your original value (412) back from 7. While this does make sense wouldnt it also imply that there are multiple inputs that give the same output?
A hash function is a versatile one-way cryptographic algorithm that maps an input of any size to a unique output of a fixed length of bits. The resulting output, which is known as a hash digest, hash value, or hash code, is the resulting unique identifier.
So answering your question: Yes, multiple inputs can give the same output if and only if the inputs are exactly the same. The analogy(take a number and add all the digits together) does not satisfy the part of the definition that states, that even if the pattern of characters/numbers is changed the hash created will be completely different.
Even if something tiny changed in an input — you capitalize a letter instead of using one that’s lowercase, or you swap an exclamation mark where there was a period — it’s going to result in the generation of an entirely new hash value and that’s the whole idea here — no matter how big or small a change, a completely different hash value gets created.
I am generating random OTP-style strings that serve as a short-term identifier to link two otherwise unrelated systems (which have authentication at each end). These need to be read and re-entered by users, so in order to reduce the error rate and reduce the opportunities for forgery, I'd like to make one of the digits a check digit. At present my random string conforms to the pattern (removing I and O to avoid confusion):
^[ABCDEFGHJKLMNPQRSTUVWXYZ][0-9]{4}$
I want to append one extra decimal digit for the check. So far I've implemented this as a BLAKE2 hash (from libsodium) that's converted to decimal and truncated to 1 char. This gives only 10 possibilities for the check digit, which isn't much. My primary objective is to detect single character errors in the input.
This approach kind of works, but it seems that one digit is not enough to detect single char errors, and undetected errors are quite easy to find, for example K37705 and K36705 are both considered valid.
I do not have a time value baked into this OTP; instead it's purely random and I'm relying on keeping a record of the OTPs that have been generated recently for each user, which are deleted periodically, and I'm reducing opportunities for brute-forcing by rate and attempt-count limiting.
I'm guessing that BLAKE2 isn't a good choice here, but given there are only 10 possibilities for the result, I don't know that others will be better. What would be a better algorithm/approach to use?
Frame challenge
Why do you need a check digit?
It doesn't improve security, and a five digits is trivial for most humans to get correct. Check if server side and return an error message if it's wrong.
Normal TOTP tokens are commonly 6 digits, and actors such as google has determined that people in general manage to get them orrect.
I'm trying to store user passwords in my DB using Argon2 algorithm.
This is what I obtain by using it:
$echo -n "password" | argon2 "smallsalt" -id -t 4 -m 18 -p 4
Type: Argon2id
Iterations: 4
Memory: 262144 KiB
Parallelism: 4
Hash: cb4447d91dd62b085a555e13ebcc6f04f4c666388606b2c401ddf803055f54ac
Encoded: $argon2id$v=19$m=262144,t=4,p=4$c21hbGxzYWx0$y0RH2R3WKwhaVV4T68xvBPTGZjiGBrLEAd34AwVfVKw
1.486 seconds
Verification ok
In this case, what should I store in the DB?
The "encoded" value as shown above?
The "hash" value as shown above?
Neither, but another solution?
Please, could you help me? I'm a newbie with this and I'm a little bit lost.
I'm a bit late to the party, but I disagree with the previous answers.
You should store the field: Encoded
The $argon2id$.... value.
(At least if you are using normal Argon2 libraries having the verify() function.
It does not look like the man-page for argon2 command does this, however.
Only if you are stuck with the command line, you should consider storing each field individually.)
The $argon2id$ encoded hash
The argon2 encoded hash follows the same as its older cousin bcrypt's syntax.
The encoded hash includes all you ever need to verify the hash when the user logs in.
It is most likely more future proof. When a newer and better argon2 comes along: You can upgrade your one column hashed passwords. Just like you could detect bcrypt's $2a$-hashes, and re-hash them as $argon2id$-hashes, next time the user logs in. (If you were moving from bcrypt to agron2.)
TL;DR
Store the $-encoded value encoded_hash in your database.
Use argon2.verify(password, encoded_hash) to verify that the password is correct.
Don't bother about all the values inside the hash. Let the library do that for you. :)
Neither. Save following as a single value:
algorithm ID (e.g. argon2id)
salt
number of iterations (4)
memory usage factor (18)
parallelism (4)
The output of the field "encoded" is misleading because you cannot use it as is for password check (i.e. for hash generation), e.g. m=262144 where as for password check you need the original factor m=18.
Are you going to launch an OS process each time you check password? I would discourage you from doing this. I'd suggest you use a library (C++, Java, ...). They produce a string that contains all these data concatenated and separated with "$".
I'd put the type, iterations, memory, parallelism, hash, salt and corresponding user id into separate columns and leave the encoded bit out, because it's just all the attributes joined together. If they're in separate columns then you can reference the attributes more easily than having to split and index the encoded string.
The other option is to just store the encoded string in 1 column, but as I said its more tedious to look at certain attributes, as you'd have to split the encoded string and then index it.
I had the same question and read this post while gathering some information. Now after some days and thoughts about all this, I'll personally take a different route than the accepted answer and therefore slightly disagree with it. I thought I would share my perspective so that it might help others as well.
I suppose it will depend on everyone's context. I don't think there is a one size fits all answer here. I'm sure there are situations where it is perfectly valid and even better/simpler to store the encoded string ($argon2...).
However, I would argue that depending on the context, storing the encoded string doesn't seem to be the right approach.
First of all, it makes the hashing method very obvious. It is probably not that important but for some reasons it makes me a bit more comfortable not having it ^^. But, more importantly, it means that implementation details are stored in your persistence layer (db or else). At the time of writing, argon2id is the recommended hashing mechanism by OWASP but these things can change (eventually do change...). Some day, it might be considered unsecure, or another function will be considered more secure.
As a result, I would suggest this more function "agnostic" starting point:
The hash (for argon2 -> the hex string)
The salt
The last_modified date
A string with hashing parameters (for argon2, you could put the parameters here in the form of your choosing)
The last_modified allows to know if the hash needs updating or not and the parameters allows to support the verification and update of "old" hashes.
Of course this means that you have to work a bit more in the code and can't simply use every libraries shortcuts straight away. But, I would say that this increased complexity offer more flexibility in other circumstances (like moving away from a given hashing function). As always there are no free lunch.
That's why I suppose it depends on your context and why personally I wouldn't go with the accepted answer in my situation.
PS: I'm no cryptography expert nor some devsecop guru. So feel free to contradict, enrich, agree or disagree. I just like to keep my implementation details contained ;)
I have streaming strings (text containing words and number).
Taking one line at a time for streaming strings, I would like to assign a unique value to them.
the examples may be:strings with their scores/hash
User1 logged in Comp1 port8087 1109
User2 logged in comp2 1135
user3 logged in port8080 1098
user1 logged in comp2 port8080 1178
these string should be in same cluster. For this what i have thought is mapping(bad type of hashing) the strings such that the small change in the string wont affect the score that much.
One simple way of doing that may be: taking UliCp8, Ulic .... ( i.e. 1st letter of each sentence) and find some way of scoring. After then the similar scored strings are kept in same bucket and later on sub group them.
The improved method would be: lets not take out first word of each word of the string but find some way to take representative value of the word such that the string representation may be quite suitable for mapping with score/hash as i mention.
Considering Levenstein distance or jaccard_index or some similarity distance metrices, all of them require inputting the strings for comparisions. Isn't there any method to hash/score the string as stated without going for comparisions.( POS tagging, comparing looks uneffiecient for my purpose as the data are streaming, huge in number, unstructured)
Hope you understand what i want to achieve and please help me out. Forgot about the comments below and lets restart.
"at least two similar word (not considering length) should have similar hash value"
This is against the most basic requirements for a hash function. With a hash function also minimal changes to the input should produce vehement changes to the bucket the hash falls into.
You are looking for an algorithm that calculates the similarity or distance between two inputs.
As stated you are not looking for a hash function, rather something like the Levenshtein distance which is an algorithm for calculating a metric representing the degree of differences between two sequences of bits. It is commonly used to find out how similar/dissimilar two strings are. Hashing / message digests are good for creating identifiers for unique, distinct values but they will produce entirely different results for "similar" values.
You are interested in the similarity of strings. Here is a nice post that names a few resources that are used for measuring string similarity. Maybe Lucene could help you in your situation.
A friend of me ask this, and i was thinking of asking this here too..
"What kind of data are this, how are they encrypted, or decrypted?"
My friend told me he got this from facebook.
d9ca6435295fcd89e85bd56c2fd51ccc
It looks like it could be an md5 hash.
Basically a hash is a one-way function. The idea is that you take some input data and run it through the algorithm to create a value (such as the string above) that has a low probability of collisions (IE, two input values hashing to the same string).
You cannot decrypt a hash because there is not enough information in the resultant string to go back. However, it may be possible for someone to figure out your input values if you use a 'weak' hashing algorithm and do not do proper techniques such as salting a hash, etc.
I don't know how FaceBook uses hashes, but a common use for a hash might be to uniquely identify a page. For example, if you had a private image on a page, you might ask to generate a link to the image that you can email to friends. That link might use a hash as part of the URL since the value can be computed quickly, is reasonably unique, and has a low probability of a third party figuring it out.
This is actually a large topic that I am by no means doing justice to. I suggest googling around for hash, md5, etc to learn more, if you are so inclinded.
It is a sequence of 128 bits, encoded as a lower-case hex string.
If you are talking about a Facebook API key, there is no deeper meaning to decode from the bits. The keys are created at random by Facebook and assigned to a particular application to identify it. Each application gets a different set of random bits for its API key.
This appears the be the...
hexadecimal representation for...
- ... a 16 bytes encryption block or..
- ... some 128 bits hash code or even
- ... just for some plain random / identifying number.
(Hexadecimal? : note how there are only 0 thru 9 digits and a thru f letters.)
While the MD5 Hash guess suggested by others is quite plausible, it could be just about anything...
If it is a hash or a identifying / randomly assigned number, its meaning is external to the code itself.
For example it could be a key to be used to locate records in a database, or a value to be compared with the result of the hash function applied to the user supplied password etc.
If it is an encrypted value, its meaning (decrypted value) is directly found within the code, but it could be just about anything. Also, assuming it is produced with modern encryption algorithm, it could take a phenomenal amount of effort to crack the code (if at all possible).