I have sometimes seen and have been recommended to store Strings and associative array keys as MD5 hash values. Now I have learnt about hashing from MIT - OCW 6.046j and it seems more like a scheme to store data in an efficient format for fast searching and to prevent people from getting back the original.
But don't languages supporting associative arrays / dictionaries do this internally? What additional advantage is the MD5 hash giving?
Most languages may support this internally, for example see Java's hashcode(), which is used when storing keys in a HashMap:
Returns a hash code value for the object. This method is supported for the benefit of hash tables such as those provided by HashMap.
But there are scenarios where you want to do it yourself.
Scenario 1 - using as key in a database:
Let's suppose you have a big no-sql-ish database of letters and metadata of these letters. You want to be able to find a letter's metadata quickly without searching. What would your index be?
One option is using a running index that's unrelated to the letter's content, but then you have to search the database before being able to find a document's metadata. Another option is to create a signature for the document composed of it's prefix (it's just an example out of many), but some documents may share this property ("Dear John,").
So how about taking into account the entire document? That's where you can use md5 as the row-key for your documents.
In this scenario you're relying on having no collisions, and the argument in favour of this assumption usually mention your chances of running into a demented gorilla being (usually) greater. The Secure Hash Algorithm family produce even less collisions.
I mention this since databases normally do not do this out of the box (frameworks may...).
Scenario 2 - One-way hash for password storage:
note: This may no longer apply for md5, but it does for the SHA-family variants.
In this scenario, you want to store passwords on your database, but storing plain-text passwords may have drawbacks if the database is compromised (user often share passwords across sites - may lead to accounts on other sites compromised as well). The usage of hashing here is storing the hashed password and when a user attempts to log-in you only compare the hash and not the password itself. This way you don't need the password stored locally and it is a lot harder to crack it.
Related
I have a user table with a password column that uses md5 hash. Over time, some of its hashes were changed to plain text (users asked for immediate password change, without using the method that would apply hash).
I have modest amount of data, i will do it by hand, but i want to know: there's something along the lines of
select * from TableName where Column is not hashed
or
update from TableName
set Column = md5(current value)
where Column is not hashed
or something like that?
As noted previously, the use of MD5 for hashing private credentials or otherwise use within the process of authentication and authorization is officially highly discouraged.
However, your best chance to detect whether a field stores an MD5 value or a non-hashed value and convert it on-the-fly is something like the following:
UPDATE TableName
SET Column = md5(Column)
WHERE Column !~ '^[a-f0-9]{32}$'
There might be remaining ones in case someone really clever guy generated an MD5 hash of something and used that directly as a password. That will not be detectable but authentication must fail on such a case as the stored value would not match the MD5 hash of the entered password at login.
You should also not think about transferring plain-text passwords down to the database for hashing and comparison as the attack surface really is pretty high already. Even if you use decent TLS for your database connections who guarantees you that an administrator or attacker hasn't enabled logging slow statements with parameters directly at the database?
Instead, the application should use a library to generate a salt and salt-hashed password directly and only transfer the salt and hash to the storage. The format specified by crypt is industry accepted (and therefore highly recommended), there are solid libraries for any programming language available and once a certain algorithm becomes deprecated, you can incrementally change it without a coordinated one-shot upgrade.
Hash-functions always create an output with a fixed length, even though the input can be infinitely large.
So how is it possible, that no information is lost here? Shouldn't some inputs result in the same output then?
Yes. Two inputs can result in the same output, resulting in a hash collision.
Hashes are designed so that hashing text is very easy, but reversing the process is difficult. The point of hashing isn't to store information. Instead, hashes are commonly used in security (and also data structures).
For instance, websites will hash a user's passwords and store the hashes instead of the physical passwords. This way, if the website's security is breached, the attacker can only obtain the hashes, which still doesn't let the attacker log in, as it is very difficult to reverse-engineer the password.
The hash set is another application of hashing. By hashing an object and storing only the hashes, you can check whether an object is present or not present in the set in constant time. You only have to search through all of the objects in the hash set that have the same hash as the object that you are checking. As the size of the hash set grows, so does the chance of a hash collision.
So how is it possible, that no information is lost here?
It's not possible, and lots of information is lost.
In the case of a perfect hash there is no collision and we could even argue that information isn't really lost (it's just not contained in the system alone) because we know all possible inputs and know there is no collisions in the hashes produced, but they can be used as an index in a way that isn't possible or as good with the input data, so they are useful.
In the case of a hash-based collection we use a hash code to (hopefully) have few collisions so we get close to O(1) lookup, but have some means to handle it if a collision does happen.
In the case of a cryptographic hash we could have collisions but it's extremely hard to deliberately do so, for similar (roughly speaking) reasons as to why its hard to break modern cryptography, so while you could have two passwords with the same hash you couldn't find it easily (especially if you aren't going to e.g. have a password of several thousand pages of text).
In the case of a checksum hash we could have collisions, but that they're unlikely means that if we have corruption we probably won't have the matching hash.
Is it a good idea to store two hashes for each password in a database (e.g. SHA-1 and MD5) and check both of the hashes in a login script to prevent collisions? On the other side, wouldn't it then be easier to calculate the password from the two hashes (for example if a hacker gets access to the database)?
This would probably not be useful.
Any hash function you'd use will be safe against accidental collisions -- they're almost impossibly unlikely. So the only collisions you're concerned with are when hackers have already compromised your database, and have your hashes, trying to figure out a password that generates the target hash.
This is called a "second preimage attack", and it's incredibly hard. There are no known second preimage attacks for any relatively recent algorithm, even going back to MD4. This shouldn't be a serious concern.
However, if you're using a generic hash function, then people brute-forcing your hacked hashes is a realistic concern. You shouldn't use a generic hash function like SHA-2, even with salts. You should use a password hash function, like bcrypt, which is resistant to brute-forcing. If you are using normal hash functions then, as you note, storing two means they only need to brute force the weaker one -- it's one more thing that can go wrong.
Don't bother. Use a password hash function instead. It will be safer and simpler.
I really don't see how there can be any benefit from storing 2 hashes for the password and then checking both. All you're doing here is giving your app more work to do and in my opinion not providing any extra level of security as they are still entering the same password.
I know there are many questions on SALT and hashing passwords, but I have yet to find a tutorial to walk me through this in VS using the MVC pattern.
I currently have a DB created with a user table containing three columns:
userID(PK, int, not null)
password(varchar(45), not null)
loginID(varchar(8), null)
The password is saved as a visible string in the DB. After researching the issue, I assume password is easiest as binary instead of varchar. Does anyone know of a good tutorial to implement hashing and SALT into my program? One that clearly defines this in terms of the MVC pattern is preferred.
MVC doesn't have anything to do with salting your passwords, although someone might point to the proper libraries that might be used with your tech stack.
Salting involves using a specific sequence, and appending that to the end of user passwords, and then hashing that data.
The reason this is done is because a hash algorithm applies on a well known string is easily reversible. A person could, for example, use well known hash algorithms against a whole dictionary, and compare to user passwords to determine what it was hashed from. While a good hash function is a one way function (aka can't find the input based on the output), if you had a dictionary to map you could easily do it for well known strings/ string combinations.
For example, the password password has a well known hash. When you attach a random sequence to the end (or start) and then hash that, it's a significantly less common hash as a result, and then it's significantly harder to reverse.
Sorry for not having the specific technologies related, but I wanted to communicate the general higher level concept of it since the over-focus on the technologies loses the bigger picture.
I was wondering - is there any disadvantages in using the hash of something as a salt of itself?
E.g. hashAlgorithm(data + hashAlgorithm(data))
This prevents the usage of lookup tables, and does not require the storage of a salt in the database. If the attacker does not have access to the source code, he would not be able to obtain the algorithm, which would make brute-forcing significantly harder.
Thoughts? (I have a gut feeling that this is bad - but I wanted to check if it really is, and if so, why.)
If the attacker does not have access to the source code
This is called "security through obscurity", which is always considered bad. An inherently safe method is always better, even if the only difference lies in the fact that you don't feel save "because they don't know how". Someone can and will always find the algorithm -- through careful analysis, trial-and-error, or because they found the source by SSH-ing to your shared hosting service, or any of a hundred other methods.
Using a hash of the data as salt for the data is not secure.
The purpose of salt is to produce unpredictable results from inputs that are otherwise the same. For example, even if many users select the same input (as a password, for example), after applying a good salt, you (or an attacker) won't be able to tell.
When the salt is a function of the data, an attacker can pre-compute a lookup table, because the salt for every password is predictable.
The best salts are chosen from a cryptographic pseudo-random number generator initialized with a random seed. If you really cannot store an extra salt, consider using something that varies per user (like a user name), together with something application specific (like a domain name). This isn't as good as a random salt, but it isn't fatally flawed.
Remember, a salt doesn't need to be secret, but it cannot be a function of the data being salted.
This offers no improvement over just hashing. Use a randomly generated salt.
The point of salting is to make it so two chronologically distinct values' hashes differ, and by so doing breaks pre-calculated lookup tables.
Consider:
data = "test"
hash = hash("test"+hash("test"))
Hash will be constant whenever data = "test". Thus, if the attacker has the algorithm (and the attacker always has the algorithm) they can pre-calculate hash values for a dictionary of data entries.
This is not salt - you have just modified the hash function. Instead of using lookup table for the original hashAlgorithm, attacker can just get the table for your modified one; this does not prevent the usage of lookup tables.
It is always better to use true random data as salt. Imagine an implementation where the username ist taken as salt value. This would lead to reduced security for common names like "root" or "admin".
I you don't want to create and manage a salt value for each hash, you could use a strong application wide salt. In most cases this would be absolutely sufficient and many other things would be more vulnerable than the hashes.