Is it possible to count how unique it will be if md5hash email addresses and truncate hash to 20 characters ?
Does it matter if truncating beginning or ending?
It will be used as samaccountname in AD.
The number of possible unique results will be reduced from 2^128 (32 characters * 4 bits) to 2^80 (20 characters * 4 bits).
Truncating the result at any point (beginning, middle or end) won't affect the entropy, only increasing the possibility of a collision. At 2^80 possibilities you shouldn't have any issue.
What would happen in the event of a collision? You would probably be safe to assume that adversaries would be able to create two e-mail addresses that hash to the same value under MD5.
If this is a concern, you may want to consider an alternative scheme (like an HMAC), which would be much harder to create a collision for, especially without knowledge of the HMAC key.
Related
I am comparing personal info of individuals, specifically their name, birthdate, gender, and race by hashing a string containing all of this info, and comparing the hash objects' hexdigests. This produces a 32 digit hexadecimal number, which I am using as a primary key in a database. For example, using my identifying string would work like this:
>> import hashlib
>> id_string = "BrianPeterson08041993MW"
>> byte_string = id_string.encode('utf-8')
>> hash_id = hashlib.md5(bytesring).hexdigest()
>> print(hash_id)
'3b807ad8a8b3a3569f098a575091bc79'
At this point, I am trying to ascertain collision risk. My understanding is that MD5 doesn't have significant collision risk, at least for strings that are relatively small, which mine are (about 20-40 characters in length). However, I am not using the 128-bit digest object, but the 32 digit hexdigest.
Now, I believe the hexdigest is a compression of the digest (that is, it's stored in fewer characters), so isn't there an increased risk of collision when comparing hexdigests? Or am I off-base?
Now, I believe the hexdigest is a compression of the digest (that is, it's stored in fewer characters), so isn't there an increased risk of collision when comparing hexdigests? Or am I off-base?
[...]
I guess my question is: don't different representations have different chances to be non-unique based on how many units of information they use to do the representation vs. how many units of information the original message takes to encode? And if so, what is the best representation to use? Um, let me preface your next answer with: talk to me like I'm 10
Old question, but yes, you were a bit off base, so to speak.
It’s the number of random bits that matters, not the length of the presentation.
The digest is just a number, an integer, which could be converted to a string using different amount of distinct digits. For example, a 128-bit number shown in some different radices:
"340106575100070649932820283680426757569" (base 10)
"ffde24cb47ecbff8d6e461a67c930dc1" (base 16, hexadecimal)
"7vroicmhvcnvsddp31kpu963e1" (base 32)
Shorter is nicer and more convenient (in auth tokens etc), but each representation has the exact same information and chance of collision. Shorter representations are shorter for the same reason as why "55" is shorter than "110111", while still encoding the same thing.
This answer might also clarify things, as well as toying with code like:
new BigInteger("340106575100070649932820283680426757569").toString(2)
...or something equivalent in other languages (Java/Scala above).
On a more practical level,
[...] which I am using as a primary key in a database
I don't see why not do away with any chance of collision by using a normal autoincremented id column (BIGINT AUTO_INCREMENT in MySQL, BIGSERIAL in PostgreSQL).
An abbreviated 32-bit hexdigest (8 hex characters) would not be long enough to effectively guarantee a collision-free database of users.
The formula for the birthday collision probability is here:
What is the probability of md5 collision if I pass in 2^32 sets of string?
Using a 32-bit key would mean that your software would start to break at around 10,000 users. The collision probability would be about 1%. It gets a lot worse very fast after that. At 100,000 users, the collision probability is 69%.
A 64-bit key, and a 10 billion users is another breaking point of about 2.7% collision rate.
For 100 billion users (a generous upper bound of the earth's population for the foreseeable future), a 96-bit key is a little risky in my opinion: collision chance is about one in 100 million. Really, you need a 128-bit key, which gives you a collision rate of about 1X10^-17.
128-bit keys are 128/4 = 32 hex characters long. If you wanted to use, a shorter key, for aesthetic purposes, you need to use 23 alphanumeric characters to exceed 128 bits. Or if you use printable characters (ASCII 32-126), you could get away with 20 characters.
So when you're talking about users, you need at least 128 bits for a collision-free random key, or a 20-32 character long string, or a 128/8 = 16 byte binary representation.
I would like to store hashes for approximately 2 billion strings. For that purpose I would like to use as less storage as possible.
Consider an ideal hashing algorithm which returns hash as series of hexadecimal digits (like an md5 hash).
As far as i understand the idea this means that i need hash to be not less and not more than 8 symbols in length. Because such hash would be capable of hashing 4+ billion (16 * 16 * 16 * 16 * 16 * 16 * 16 * 16) distinct strings.
So I'd like to know whether it is it safe to cut hash to a certain length to save space ?
(hashes, of course, should not collide)
Yes/No/Maybe - i would appreciate answers with explanations or links to related studies.
P.s. - i know i can test whether 8-character hash would be ok to store 2 billion strings. But i need to compare 2 billion hashes with their 2 billion cutted versions. It doesn't seem trivial to me so i'd better ask before i do that.
The hash is a number, not a string of hexadecimal numbers (characters). In case of MD5, it is 128 bits or 16 bytes saved in efficient form. If your problem still applies, you sure can consider truncating the number (by either coersing into a word or first bitshifting by). Good hash algorithms distribute evenly to all bits.
Addendum:
Generally whenever you deal with hashes, you want to check if the strings really match. This takes care of the possibility of collising hashes. The more you cut the hash the more collisions you're going to get. But it's good to plan for that happening at this phase.
Whether or not its safe to store x values in a hash domain only capable of representing 2x distinct hash values depends entirely on whether you can tolerate collisions.
Hash functions are effectively random number generators, so your 2 billion calculated hash values will be distributed evenly about the 4 billion possible results. This means that you are subject to the Birthday Problem.
In your case, if you calculate 2^31 (2 billion) hashes with only 2^32 (4 billion) possible hash values, the chance of at least two having the same hash (a collision) is very, very nearly 100%. (And the chance of three being the same is also very, very nearly 100%. And so on.) I can't find the formula for calculating the probable number of collisions based on these numbers, but I suspect it is a huge number.
If in your case hash collisions are not a disaster (such as in Java's HashMap implementation which deals with collisions by turning the hash target into a list of objects which share the same hash key, albeit at the cost of reduced performance) then maybe you can live with the certainty of a high number of collisions. But if you need uniqueness then you need either a far, far larger hash domain, or you need to assign each record a guaranteed-unique serial ID number, depending on your purposes.
Finally, note that Keccak is capable of generating any desired output length, so it makes little sense to spend CPU resources generating a long hash output only to trim it down afterwards. You should be able to tell your Keccak function to give only the number of bits you require. (Also note that a change in Keccak output length does not affect the initial output bits, so the result will be exactly the same as if you did a manual bitwise trim afterwards.)
Suppose I have only the first 16 characters of a MD5 hash. If I use brute force attack or rainbow tables or any other method to retrieve the original password, how many compatible candidates have I to expect? 1? (I do not think) 10, 100, 1000, 10^12? Even a rough answer is welcome (for the number, but please be coherent with hash theory and methodology).
The output of MD5 is 16 bytes (128 bits). I suppose that you are talking about an hexadecimal representation, hence as 32 characters. Thus, "16 characters" means "64 bits". You are considering MD5 with its output truncated to 64 bits.
MD5 accepts inputs up to 264 bits in length; assuming that MD5 behaves as a random function, this means that the 218446744073709551616 possible input strings will map more or less uniformly among the 264 outputs, hence the average number of candidates for a given output is about 218446744073709551552, which is close to 105553023288523357112.95.
However, if you consider that you can find at least one candidate, then this means that the space of possible passwords that you consider is much reduced. A rainbow table is a special kind of precomputed table which accepts a compact representation (at the expense of a relatively expensive lookup procedure), but if it covers N passwords, then this means that, at some point, someone could apply the hash function N times. In practice, this severely limits the size N. Assuming N=260 (which means that the table builder had about one hundred NVidia GTX 580 GPU and could run them for six months; also, the table will use quite a lot of hard disks), then, on average, only 1/16th of 64-bit outputs have a matching password in the table. For those passwords which are in the table, there is a 93.75% probability that there is no other password in the table which leads to the same output; if you prefer, if you find a matching password, then you will find, on average, 0.0625 other candidates (i.e. most of the time, no other candidate).
In brief, the answer to your question depends on the size N of the space of possible passwords that you consider (those which were covered during rainbow table construction); but, in practice with Earth-based technology, if you can find one matching password for a 64-bit output, chances are that you will not be able to find another (although there are are really many others).
You should never ever be able to get a password from a partial hash.
I have heard that when creating a hash, it's possible that if small files or amounts of data are used, the resulting hash is more likely to suffer from a collision. If that is true, is there a minimum "safe" amount of data that should be used to ensure this doesn't happen?
I guess the question could also be phrased as:
What is the smallest amount of data that can be safely and securely hashed?
A hash function accepts inputs of arbitrary (or at least very high) length, and produces a fixed-length output. There are more possible inputs than possible outputs, so collisions must exist. The whole point of a secure hash function is that it is "collision resistant", which means that while collisions must mathematically exist, it is very very hard to actually compute one. Thus, there is no known collision for SHA-256 and SHA-512, and the best known methods for computing one (by doing it on purpose) are so ludicrously expensive that they will not be applied soon (the whole US federal budget for a century would buy only a ridiculously small part of the task).
So, if it cannot be realistically done on purpose, you can expect not to hit a collision out of (bad) luck.
Moreover, if you limit yourself to very short inputs, there is a chance that there is no collision at all. E.g., if you consider 12-byte inputs: there are 296 possible sequences of 12 bytes. That's huge (more than can be enumerated with today's technology). Yet, SHA-256 will map each input to a 256-bit value, i.e. values in a much wider space (of size 2256). We cannot prove it formally, but chances are that all those 296 hash values are distinct from each other. Note that this has no practical consequence: there is no measurable difference between not finding a collision because there is none, and not finding a collision because it is extremely improbable to hit one.
Just to illustrate how low risks of collision are with SHA-256: consider your risks of being mauled by a gorilla escaped from a local zoo or private owner. Unlikely? Yes, but it still may conceivably happen: it seems that a gorilla escaped from the Dallas zoo in 2004 and injured four persons; another gorilla escaped from the same zoo in 2010. Assuming that there is only one rampaging gorilla every 6 years on the whole Earth (not only in the Dallas area) and you happen to be the unlucky chap who is on his path, out of a human population of 6.5 billions, then risks of grievous-bodily-harm-by-gorilla can be estimated at about 1 in 243.7 per day. Now, take 10 thousands of PC and have them work on finding a collision for SHA-256. The chances of hitting a collision are close to 1 in 275 per day -- more than a billion less probable than the angry ape thing. The conclusion is that if you fear SHA-256 collisions but do not keep with you a loaded shotgun at all times, then you are getting your priorities wrong. Also, do not mess with Texas.
There is no minimum input size. SHA-256 algorithm is effectively a random mapping and collision probability doesn't depend on input length. Even a 1 bit input is 'safe'.
Note that the input is padded to a multiple of 512 bits (64 bytes) for SHA-256 (multiple of 1024 for SHA-512). Taking a 12 byte input (as Thomas used in his example), when using SHA-256, there are 2^96 possible sequences of length 64 bytes.
As an example, a 12 byte input Hello There! (0x48656c6c6f20546865726521) will be padded with a one bit, followed by 351 zero bits followed by the 64 bit representation of the length of the input in bits which is 0x0000000000000060 to form a 512 bit padded message. This 512 bit message is used as the input for computing the hash.
More details can be found in RFC: 4634 "US Secure Hash Algorithms (SHA and HMAC-SHA)", http://www.ietf.org/rfc/rfc4634.txt
No, message length does not effect the likeliness of a collision.
If that were the case, the algorithm is broken.
You can try for yourself by running SHA against all one-byte inputs, then against all two-byte inputs and so on, and see if you get a collision. Probably not, because no one has ever found a collision for SHA-256 or SHA-512 (or at least they kept it a secret from Wikipedia)
Τhe hash is 256 bits long, there is a collision for anything longer than 256bits.
Υou cannot compress something into a smaller thing without having collisions, its defying mathmatics.
Yes, because of the algoritm and the 2 to the power of 256 there is a lot of different hashes, but they are not collision free, that is impossible.
Depends very much on your application: if you were simply hashing "YES" and "NO" strings to send across a network to indicate whether you should give me a $100,000 loan, it would be a pretty big failure -- the domain of answers can't be that large, so someone could easily check observed hashes on the wire against a database of 'small input' hash outputs.
If you were to include the date, time, my name, my tax ID, the amount requested, the amount of data being hashed probably won't amount to much, but the chances of that data being in precomputed hash tables is pretty slim.
But I know of no research to point you to beyond my instincts. Sorry.
I need an hash algorithm that outputs an alphanumeric string that is max 20 characters long. For "alphanumeric" I mean [a-zA-Z0-9].
Inputs are UUIDs in canonical form (example 550e8400-e29b-41d4-a716-446655440000)
In alternative is there a way to convert a SHA1 or MD5 hash to a string with these limitations?
Thanks.
EDIT
Doesn't need to be cryptographically secure. Collisions make data inaccurate, but if they happen sporadically I can live with it.
EDIT 2
I don't know if truncating MD5 or SHA1 would make collisions happen too often. Now I'm wondering if it's better to truncate to 20 chars a MD5 value or a SHA1 value.
Just clip the characters you don't need from the hash of the GUID. With a good hash function, the unpredictability of any part of the hash is proportional to the part's size. If you want, you can encode it base 32 instead of the standard hex base 16. Bear in mind that this will not significantly improve entropy per character (only by 25%).
For non-cryptographic uses, it does not matter whether you truncate MD5, SHA1 or SHA2. Neither has any glaring deficiencies in entropy.