How to reduce hash value's length? - hash

I'd like to squeeze or compress the result hash value from MD5 or SHA1 at a server side application so that at the client can decompress it or desqueeze it , is this possible ? its a usability issue for my application.

No, hash values cannot be compressed. By design their bits are highly random and have maximum entropy, so there is no redundancy to compress.
If you want to make the hash values easier to read for users you can use different tricks, such as:
Displaying fewer digits. Instead of 32 digits just show 16.
Using a different base. For instance, if you used base 62 using all the uppercase and lowercase letters plus numbers 0-9 as digits then you could show a 128-bit hash using 22 letters+digits versus 32 hex digits:
log62 (2128) ≈ 21.5
Adding whitespace or punctuation. You'll commonly see CD keys printed with dashes like AX7T4-BZ41O-JK3FF-QOZ96. It's easier for users to read this than 20 digits all jammed together.

Hash values are quite short; attempting compression on these (quite random and highly varied) values is difficult and inefficient. If you want to save space, truncating the value could help, but keep in mind that if you do this, you increase collision space (and decrease key space).

Related

Does halving every SHA224 2 bytes to 1 byte to halve the hash length introduce a higher collision risk?

Let's say I have strings that need not be reversible and let's say I use SHA224 to hash it.
The hash of hello world is 2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b and its length is 56 bytes.
What if I convert every two chars to its numerical representation and make a single byte out of them?
In Python I'd do something like this:
shalist = list("2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b")
for first_byte,next_byte in zip(shalist[0::2],shalist[1::2]):
chr(ord(first_byte)+ord(next_byte))
The result will be \x98ek\x9d\x95\x96\x96\xc7\xcb\x9ckhf\x9a\xc7\xc9\xc8\x97\x97\x99\x97\xc9gd\x96im\x94. 28 bytes. Effectively halved the input.
Now, is there a higher hash collision risk by doing so?
The simple answer is pretty obvious: yes, it increases the chance of collision by as many powers of 2 as there are bits missing. For 56 bytes halved to 28 bytes you get the chance of collision increased 2^(28*8). That still leaves the chance of collision at 1:2^(28*8).
Your use of that truncation can be still perfectly legit, depending what it is. Git for example shows only the first few bytes from a commit hash and for most practical purposes the short one works fine.
A "perfect" hash should retain a proportional amount of "effective" bits if you truncate it. For example 32 bits of SHA256 result should have the same "strength" as a 32-bit CRC, although there may be some special properties of CRC that make it more suitable for some purposes while the truncated SHA may be better for others.
If you're doing any kind of security with this it will be difficult to prove your system, you're probably better of using a shorter but complete hash.
Lets shrink the size to make sense of it and use 2 bytes hash instead of 56. The original hash will have 65536 possible values, so if you hash more than that many strings you will surely get a collision. Half that to 1 bytes and you will get a collision after at most 256 strings hashed, regardless do you take the first or the second byte. So your chance of collision is 256 greater (2^(1byte*8bits)) and is 1:256.
Long hashes are used to make it truly impractical to brute-force them, even after long years of cryptanalysis. When MD5 was introduced in 1991 it was considered secure enough to use for certificate signing, in 2008 it was considered "broken" and not suitable for security-related use. Various cryptanalysis techniques can be developed to reduce the "effective" strength of hash and encryption algorithms, so the more spare bits there are (in an otherwise strong algorithm) the more effective bits should remain to keep the hash secure for all practical purposes.

Is it safe to cut the hash?

I would like to store hashes for approximately 2 billion strings. For that purpose I would like to use as less storage as possible.
Consider an ideal hashing algorithm which returns hash as series of hexadecimal digits (like an md5 hash).
As far as i understand the idea this means that i need hash to be not less and not more than 8 symbols in length. Because such hash would be capable of hashing 4+ billion (16 * 16 * 16 * 16 * 16 * 16 * 16 * 16) distinct strings.
So I'd like to know whether it is it safe to cut hash to a certain length to save space ?
(hashes, of course, should not collide)
Yes/No/Maybe - i would appreciate answers with explanations or links to related studies.
P.s. - i know i can test whether 8-character hash would be ok to store 2 billion strings. But i need to compare 2 billion hashes with their 2 billion cutted versions. It doesn't seem trivial to me so i'd better ask before i do that.
The hash is a number, not a string of hexadecimal numbers (characters). In case of MD5, it is 128 bits or 16 bytes saved in efficient form. If your problem still applies, you sure can consider truncating the number (by either coersing into a word or first bitshifting by). Good hash algorithms distribute evenly to all bits.
Addendum:
Generally whenever you deal with hashes, you want to check if the strings really match. This takes care of the possibility of collising hashes. The more you cut the hash the more collisions you're going to get. But it's good to plan for that happening at this phase.
Whether or not its safe to store x values in a hash domain only capable of representing 2x distinct hash values depends entirely on whether you can tolerate collisions.
Hash functions are effectively random number generators, so your 2 billion calculated hash values will be distributed evenly about the 4 billion possible results. This means that you are subject to the Birthday Problem.
In your case, if you calculate 2^31 (2 billion) hashes with only 2^32 (4 billion) possible hash values, the chance of at least two having the same hash (a collision) is very, very nearly 100%. (And the chance of three being the same is also very, very nearly 100%. And so on.) I can't find the formula for calculating the probable number of collisions based on these numbers, but I suspect it is a huge number.
If in your case hash collisions are not a disaster (such as in Java's HashMap implementation which deals with collisions by turning the hash target into a list of objects which share the same hash key, albeit at the cost of reduced performance) then maybe you can live with the certainty of a high number of collisions. But if you need uniqueness then you need either a far, far larger hash domain, or you need to assign each record a guaranteed-unique serial ID number, depending on your purposes.
Finally, note that Keccak is capable of generating any desired output length, so it makes little sense to spend CPU resources generating a long hash output only to trim it down afterwards. You should be able to tell your Keccak function to give only the number of bits you require. (Also note that a change in Keccak output length does not affect the initial output bits, so the result will be exactly the same as if you did a manual bitwise trim afterwards.)

is it possible to retrieve a password from a (partial) MD5 hash?

Suppose I have only the first 16 characters of a MD5 hash. If I use brute force attack or rainbow tables or any other method to retrieve the original password, how many compatible candidates have I to expect? 1? (I do not think) 10, 100, 1000, 10^12? Even a rough answer is welcome (for the number, but please be coherent with hash theory and methodology).
The output of MD5 is 16 bytes (128 bits). I suppose that you are talking about an hexadecimal representation, hence as 32 characters. Thus, "16 characters" means "64 bits". You are considering MD5 with its output truncated to 64 bits.
MD5 accepts inputs up to 264 bits in length; assuming that MD5 behaves as a random function, this means that the 218446744073709551616 possible input strings will map more or less uniformly among the 264 outputs, hence the average number of candidates for a given output is about 218446744073709551552, which is close to 105553023288523357112.95.
However, if you consider that you can find at least one candidate, then this means that the space of possible passwords that you consider is much reduced. A rainbow table is a special kind of precomputed table which accepts a compact representation (at the expense of a relatively expensive lookup procedure), but if it covers N passwords, then this means that, at some point, someone could apply the hash function N times. In practice, this severely limits the size N. Assuming N=260 (which means that the table builder had about one hundred NVidia GTX 580 GPU and could run them for six months; also, the table will use quite a lot of hard disks), then, on average, only 1/16th of 64-bit outputs have a matching password in the table. For those passwords which are in the table, there is a 93.75% probability that there is no other password in the table which leads to the same output; if you prefer, if you find a matching password, then you will find, on average, 0.0625 other candidates (i.e. most of the time, no other candidate).
In brief, the answer to your question depends on the size N of the space of possible passwords that you consider (those which were covered during rainbow table construction); but, in practice with Earth-based technology, if you can find one matching password for a 64-bit output, chances are that you will not be able to find another (although there are are really many others).
You should never ever be able to get a password from a partial hash.

Hash algorithm with alphanumeric output of 20 characters max

I need an hash algorithm that outputs an alphanumeric string that is max 20 characters long. For "alphanumeric" I mean [a-zA-Z0-9].
Inputs are UUIDs in canonical form (example 550e8400-e29b-41d4-a716-446655440000)
In alternative is there a way to convert a SHA1 or MD5 hash to a string with these limitations?
Thanks.
EDIT
Doesn't need to be cryptographically secure. Collisions make data inaccurate, but if they happen sporadically I can live with it.
EDIT 2
I don't know if truncating MD5 or SHA1 would make collisions happen too often. Now I'm wondering if it's better to truncate to 20 chars a MD5 value or a SHA1 value.
Just clip the characters you don't need from the hash of the GUID. With a good hash function, the unpredictability of any part of the hash is proportional to the part's size. If you want, you can encode it base 32 instead of the standard hex base 16. Bear in mind that this will not significantly improve entropy per character (only by 25%).
For non-cryptographic uses, it does not matter whether you truncate MD5, SHA1 or SHA2. Neither has any glaring deficiencies in entropy.

Hash length reduction?

I know that say given a md5/sha1 of a value, that reducing it from X bits (ie 128) to say Y bits (ie 64 bits) increases the possibility of birthday attacks since information has been lost. Is there any easy to use tool/formula/table that will say what the probability of a "correct" guess will be when that length reduction occurs (compared to its original guess probability)?
Crypto is hard. I would recommend against trying to do this sort of thing. It's like cooking pufferfish: Best left to experts.
So just use the full length hash. And since MD5 is broken and SHA-1 is starting to show cracks, you shouldn't use either in new applications. SHA-2 is probably your best bet right now.
I would definitely recommend against reducing the bit count of hash. There are too many issues at stake here. Firstly, how would you decide which bits to drop?
Secondly, it would be hard to predict how the dropping of those bits would affect the distribution of outputs in the new "shortened" hash function. A (well-designed) hash function is meant to distribute inputs evenly across the whole of the output space, not a subset of it.
By dropping half the bits you are effectively taking a subset of the original hash function, which might not have nearly the desirably properties of a properly-designed hash function, and may lead to further weaknesses.
Well, since every extra bit in the hash provides double the number of possible hashes, every time you shorten the hash by a bit, there are only half as many possible hashes and thus the chances of guessing that random number is doubled.
128 bits = 2^128 possibilities
thus
64 bits = 2^64
so by cutting it in half, you get
2^64 / 2^128 percent
less possibilities