What is hash exactly? - hash

I am learning MD5. I found a term 'hash' in most description of MD5. I googled 'hash', but I could not find exact term of 'hash' in computer programming.
Why are we using 'hash' in computer programming? What is origin of the word??

I would say any answer must be guesswork, so I will make this a community wiki.
Hash, or hash browns, is breakfast food made from cutting potatoes into long thin strips (smaller than french fries, and shorter, but proportionally similar), then frying the mass of strips in animal or vegetable fat until browned, stuck together, and cooked. By analogy, 'hashing' a number meant turning it into another, usually smaller, number using a method which still deterministically depending on the input number.
I believe the term "hash" was first used in the context of "hash table", which was commonly used in the 1960's on mainframe-type machines. In these cases, usually an integer value with a large range is converted to a "hash table index" which is a small integer. It is important for an efficient hash table that the "hash function" be evenly distributed, or "flat."
I don't have a citation, that is how I have understood the analogy since I heard it in the 80's. Someone must have been there when the term was first applied, though.

A hash value (or simply hash), also
called a message digest, is a number
generated from a string of text. The
hash is substantially smaller than the
text itself, and is generated by a
formula in such a way that it is
extremely unlikely that some other
text will produce the same hash value.

You're refering to the "hash function". It is used to generate a unique value for a given set of parameters.
One great use of a hash is password security. Instead of saving a password in a database, you save a hash of the password.

A hash is supposed to be a unique combination of values from 00 to FF (hexadecimal) that represents a certain piece of data, be it a file or a string of bytes. It is used primarily for password storage and verification, and to test if a file is the same as another file (i.e. you hash two files, if they match, they're the same file).
Generally, any of the SHA algorithms are preferred over MD5, due to hash collisions that can occur when using it. See this Wikipedia article.

According to the Wikipedia article on hash functions, Donald Knuth in the Art of Computer Programming was able to trace the concept of hash functions back to an internal IBM memo by Hans Peter Luhn in 1953.
And just for fun, here's a scrap of overheard conversation quoted in Two Women in the Klondike: the Story of a Journey to the Gold Fields of Alaska (1899):
They'll have to keep the hash table going all day long to feed us. 'T will be a short order affair.

the hash function hashes input to a value, requires a salt value and no proof salt is needed to store. Indications are everybody says we must store the salt same time match and new still work. Mathematically related concept is bijection

adding to gabriel1836's answer, one of the important properties of hash function is that it is a one way function, which means you cannot generate the original string from its hash value.

Related

How do I store Argon2 passwords in my database?

I'm trying to store user passwords in my DB using Argon2 algorithm.
This is what I obtain by using it:
$echo -n "password" | argon2 "smallsalt" -id -t 4 -m 18 -p 4
Type: Argon2id
Iterations: 4
Memory: 262144 KiB
Parallelism: 4
Hash: cb4447d91dd62b085a555e13ebcc6f04f4c666388606b2c401ddf803055f54ac
Encoded: $argon2id$v=19$m=262144,t=4,p=4$c21hbGxzYWx0$y0RH2R3WKwhaVV4T68xvBPTGZjiGBrLEAd34AwVfVKw
1.486 seconds
Verification ok
In this case, what should I store in the DB?
The "encoded" value as shown above?
The "hash" value as shown above?
Neither, but another solution?
Please, could you help me? I'm a newbie with this and I'm a little bit lost.
I'm a bit late to the party, but I disagree with the previous answers.
You should store the field: Encoded
The $argon2id$.... value.
(At least if you are using normal Argon2 libraries having the verify() function.
It does not look like the man-page for argon2 command does this, however.
Only if you are stuck with the command line, you should consider storing each field individually.)
The $argon2id$ encoded hash
The argon2 encoded hash follows the same as its older cousin bcrypt's syntax.
The encoded hash includes all you ever need to verify the hash when the user logs in.
It is most likely more future proof. When a newer and better argon2 comes along: You can upgrade your one column hashed passwords. Just like you could detect bcrypt's $2a$-hashes, and re-hash them as $argon2id$-hashes, next time the user logs in. (If you were moving from bcrypt to agron2.)
TL;DR
Store the $-encoded value encoded_hash in your database.
Use argon2.verify(password, encoded_hash) to verify that the password is correct.
Don't bother about all the values inside the hash. Let the library do that for you. :)
Neither. Save following as a single value:
algorithm ID (e.g. argon2id)
salt
number of iterations (4)
memory usage factor (18)
parallelism (4)
The output of the field "encoded" is misleading because you cannot use it as is for password check (i.e. for hash generation), e.g. m=262144 where as for password check you need the original factor m=18.
Are you going to launch an OS process each time you check password? I would discourage you from doing this. I'd suggest you use a library (C++, Java, ...). They produce a string that contains all these data concatenated and separated with "$".
I'd put the type, iterations, memory, parallelism, hash, salt and corresponding user id into separate columns and leave the encoded bit out, because it's just all the attributes joined together. If they're in separate columns then you can reference the attributes more easily than having to split and index the encoded string.
The other option is to just store the encoded string in 1 column, but as I said its more tedious to look at certain attributes, as you'd have to split the encoded string and then index it.
I had the same question and read this post while gathering some information. Now after some days and thoughts about all this, I'll personally take a different route than the accepted answer and therefore slightly disagree with it. I thought I would share my perspective so that it might help others as well.
I suppose it will depend on everyone's context. I don't think there is a one size fits all answer here. I'm sure there are situations where it is perfectly valid and even better/simpler to store the encoded string ($argon2...).
However, I would argue that depending on the context, storing the encoded string doesn't seem to be the right approach.
First of all, it makes the hashing method very obvious. It is probably not that important but for some reasons it makes me a bit more comfortable not having it ^^. But, more importantly, it means that implementation details are stored in your persistence layer (db or else). At the time of writing, argon2id is the recommended hashing mechanism by OWASP but these things can change (eventually do change...). Some day, it might be considered unsecure, or another function will be considered more secure.
As a result, I would suggest this more function "agnostic" starting point:
The hash (for argon2 -> the hex string)
The salt
The last_modified date
A string with hashing parameters (for argon2, you could put the parameters here in the form of your choosing)
The last_modified allows to know if the hash needs updating or not and the parameters allows to support the verification and update of "old" hashes.
Of course this means that you have to work a bit more in the code and can't simply use every libraries shortcuts straight away. But, I would say that this increased complexity offer more flexibility in other circumstances (like moving away from a given hashing function). As always there are no free lunch.
That's why I suppose it depends on your context and why personally I wouldn't go with the accepted answer in my situation.
PS: I'm no cryptography expert nor some devsecop guru. So feel free to contradict, enrich, agree or disagree. I just like to keep my implementation details contained ;)

Difference between preimage resistance and second-preimage resistance

Wikipedia says:
preimage resistance: for essentially all pre-specified outputs, it is computationally infeasible to find any input which hashes to that output, i.e., it is difficult to find any preimage x given a "y" such that h(x) = y.
second-preimage resistance: it is computationally infeasible to find any second input which has the same output as a specified input, i.e., given x, it is difficult to find a second preimage x' ≠ x such that h(x) = h(x′).
Yet, I don't understand it. Doesn't h(x′) (where x' is input) generate that y (the output), which is then compared to the same h(x)?
Say, I have a string "example". It generates the MD5 "1a79a4d60de6718e8e5b326e338ae533". Why is it different to just use the MD5 compared to doing the MD5(example)?
Ideal hashing is like taking the fingerprint of a person, it is unique, it is non-reversible (you can't get the whole person back just from the fingerprint) and it can serve as a short and simple identifier for the given person.
If we bring some of the terminology you introduced into our analogy, we see that preimage resistance refers to the hash function's ability to be non-reversible. Imagine if you could generate the likeness of a whole person from their fingerprint, aside from being really cool, this could also be very dangerous. For the same reason, hash functions must be made so that an attacker cannot find the original message that generated the hash. In that sense, hash functions are one-way in that the message generates the hash and not the other way round.
Second preimage resistance refers to a given hash function's ability to be unique. Forensic fingerprinting would be a gross waste of time if any number of individuals could share the same fingerprint (lets exclude identical twins for now. Edit: See Det's comment below). If a given hash was used for verification against data corruption, it would quite pointless if there is a good chance corrupt data can generate the same hash.
To have both preimage resistance and second preimage resistance hash functions adopt several traits to help them. One trait very common for hash functions is where the given input has no correspondence to the output. A single bit change can produce a hash that has completely no bytes shared with the hash of the original input. For this reason, a good hash function is commonly used in message authentication.
Whilst you are right comparing the original message directly would be functionally equivalent to comparing the hashes, it is simply not feasible in the majority of cases. For example:
If party A wanted to reliably send a message to party B, party A/B would need to agree upon a scheme to detect data corruption during transfer. Note: party B does not have the original message until party A sends it.
A possible scheme of transfer could be to transfer the message twice such that party B can verify if the second message equals the first. The problem with this is that there is a chance that corruption can occur twice in the same place (as well as the significantly higher bandwidth). This can only be reduced by sending the messages even more times, incurring severe bandwidth costs.
As an alternative, party A can pass his/her long message into a hash function and generate a short hash which he/she sends to party B, followed by the original message. Party B can then take the received message and pass it into the hash function and match the hashes. If either the message or the hash got corrupted even by a single bit during transfer, the resultant hashes will not match, thanks to second preimage resistance (no two plaintext should have same hash).
Preimage Resistance in this case would be useful if the message is encrypted during transfer but the hash was taken prior encryption (whether this is appropriate is another discussion). If the hash was reversible, a eavesdropper could intercept the hash and reverse to find the original message.
All hash functions are not equal, that's why its important to consider their preimage resistance/second preimage resistance when choosing which ones to use, which ones are secure and which ones should be deprecated and replaced.
You understood preimage and second preimage resistance? It says the output of a hash function is unique, at least in theory.. And obtaining the original string from a hash is "computationally" in-feasible. It is possible (brute-force) though but takes up a lot of time and resources.
Now, output of a hash function and the string itself are different.. For example, consider a website with a dashboard. You provide your user_id and password at the time of signing up. If the website stores your password as such in their database, it is accessible to a hacker. He can access your account. But if a hash of your password is stored, even if he manages to hack down the server, that hash is of no use to him. Because, he cannot access your account without your password, and it is computationally in-feasible to obtain your password from the hash (preimage resistance). Comparing md5 (yourpassword) with the hash stored in the db is different. Each time you enter your password, it is hashed with the sampe hash function and compared to the existing hash. According to second-preimage resistance, if you entered an incorrect password, the hashes won't match.
Another example of hashing is in the version control or source control mechanisms. To track down changes in a file, hashing is used. They hash the entire file and keeps it. If a file is modified, its hash changes accordingly.
These are all examples explaining what you asked.

What kind of encrypted data is this?

A friend of me ask this, and i was thinking of asking this here too..
"What kind of data are this, how are they encrypted, or decrypted?"
My friend told me he got this from facebook.
d9ca6435295fcd89e85bd56c2fd51ccc
It looks like it could be an md5 hash.
Basically a hash is a one-way function. The idea is that you take some input data and run it through the algorithm to create a value (such as the string above) that has a low probability of collisions (IE, two input values hashing to the same string).
You cannot decrypt a hash because there is not enough information in the resultant string to go back. However, it may be possible for someone to figure out your input values if you use a 'weak' hashing algorithm and do not do proper techniques such as salting a hash, etc.
I don't know how FaceBook uses hashes, but a common use for a hash might be to uniquely identify a page. For example, if you had a private image on a page, you might ask to generate a link to the image that you can email to friends. That link might use a hash as part of the URL since the value can be computed quickly, is reasonably unique, and has a low probability of a third party figuring it out.
This is actually a large topic that I am by no means doing justice to. I suggest googling around for hash, md5, etc to learn more, if you are so inclinded.
It is a sequence of 128 bits, encoded as a lower-case hex string.
If you are talking about a Facebook API key, there is no deeper meaning to decode from the bits. The keys are created at random by Facebook and assigned to a particular application to identify it. Each application gets a different set of random bits for its API key.
This appears the be the...
hexadecimal representation for...
- ... a 16 bytes encryption block or..
- ... some 128 bits hash code or even
- ... just for some plain random / identifying number.
(Hexadecimal? : note how there are only 0 thru 9 digits and a thru f letters.)
While the MD5 Hash guess suggested by others is quite plausible, it could be just about anything...
If it is a hash or a identifying / randomly assigned number, its meaning is external to the code itself.
For example it could be a key to be used to locate records in a database, or a value to be compared with the result of the hash function applied to the user supplied password etc.
If it is an encrypted value, its meaning (decrypted value) is directly found within the code, but it could be just about anything. Also, assuming it is produced with modern encryption algorithm, it could take a phenomenal amount of effort to crack the code (if at all possible).

how can get original value from hash value?

My original Text : "sanjay"
SHA-1 Text : "25ecbcb559d14a98e4665d6830ac5c99991d7c25"
Now how can i get original value - "sanjay" from this hash value ?
is there any code or algorithm or method?
No. That's usually the point -- the process of hashing is normally one-way.
This is especially important for hashes designed for passwords or cryptology -- which differ from hashes designed, for say, hash-maps. Also, with an unbounded input length, there is an infinite amount of values which result in the same hash.
One method that can be used is to hash a bunch of values (e.g. brute-force from aaaaaaaa-zzzzzzz) and see which value has the same hash. If you have found this, you have found "the value" (the time is not cheap). "Rainbow tables" work on this idea (but use space instead of time), but are defeated with a nonce salt.
From what I've been taught on the subject, if you were the one that turned your value into a hash value, chances are you have full access to the hash function, and would be able to reverse it in the same way. If you only have the original value and the end value, and don't know what hash function was used, you can't really reverse it without doing what was said above (going over every possibility).

How can I generate a unique, small, random, and user-friendly key?

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.