How many string characters should I read to get a good hash?

How many string characters should I read to get a good hash? - hash

Here is a little conundrum for you: If you use a hash algorithm like CRC-64 then how many bytes in a string would be necessary to read to calculate a good hash? Lets say all your strings are at least 2 KB long then it seems a waste or resources using the whole string to calculate the cache, but just how many characters do you think is enough? Would just 8 ASCII-characters be enough since it equals 64-bits? Wont using more than 8 ASCII characters just be pointless? I want to know your though on this.
Update:
With a 'good hash' I mean the point where the likelihood of hash collisions can not get any less by using even more bytes to calculate it.

If you use CRC-64 over 8 bytes or less then there is no point in using CRC-64: just use the 8 bytes "as is". A CRC does not have any added value unless the input is longer than the intended output.
As a general rule, if your hash function has an output of n bits then collisions begin to appear once you have accumulated about 2n/2 strings. In shorter words, if you use 64 bits, then it is very unlikely that you encounter a collision in the first 2 billions of strings. If you get a 160-bit or more output, then collisions are virtually unfeasible (you will encounter much less collisions than hardware failures such as the CPU catching fire). This assumes that the hash function is "perfect". If your hash function begins by selecting a few data bytes, then, necessarily, the bytes that you do not select cannot have any influence on the hash output, so you'd better use the "good" bytes -- which utterly depends on the kind of strings that you are hashing. There is no general rule here.
My advice would be to first try using a generic hash function over the whole string; I usually recommend MD4. MD4 is a cryptographic hash function, which has been utterly broken, but for a problem with no security involved, it is still very good at mixing data elements (cryptographically speaking, a CRC is so much more broken than MD4). MD4 has been reported to actually be faster than CRC-32 on some platforms, so you could give it a shot. On a basic PC (my 2.4 GHz Core2), a MD4 implementation works at about 700 MBytes/s, so we are talking about 35000 hashed 2 kB strings per second, which is not bad.

What are the chances that the first 8 letters of two different strings are the same? Depending on what these strings are, it could be very high, in which case you'll definitely get hash collisions.
Hash the whole thing. A few kilobytes is nothing. Unless you actually have a need to save nanoseconds in your program, not hashing the full strings would be premature optimization.

Related

Hash that generates Decimal output for Swift

I want to hashed a String into a hashed object which has some numerical values NSNumber/Int as an output instead of alpha-numeric values.
The problem is that after digging through swift and some 3rd party library, I'm not able to find any library that suffices our need.
I'm working on a Chat SDK and it takes NSNumber/Int as unique identifier to co-relate Chat Message and Conversation Message.
My company demand is not to store any addition field onto the database
or change the schema that we have which complicates thing.
A neat solution my team came with was some sort of hashed function that generates number.
func userIdToConversationNumber(id:String) -> NSNumber
We can use that function to convert String to NSNumber/Int. This Int should be produced by that function and probability of colliding should be negligible. Any suggestion on any approach.

The key calculation you need to perform is the birthday bound. My favorite table is the one in Wikipedia, and I reference it regularly when I'm designing systems like this one.
The table expresses how many items you can hash for a given hash size before you have a certain expectation of a collision. This is based on a perfectly uniform hash, which a cryptographic hash is a close approximation of.
So for a 64-bit integer, after hashing 6M elements, there is a 1-in-a-million chance that there was a single collision anywhere in that list. After hashing 20M elements, there is a 1-in-a-thousand chance that there was a single collision. And after 5 billion elements, you should bet on a collision (50% chance).
So it all comes down to how many elements you plan to hash and how bad it is if there is a collision (would it create a security problem? can you detect it? can you do anything about it like change the input data?), and of course how much risk you're willing to take for the given problem.
Personally, I'm a 1-in-a-million type of person for these things, though I've been convinced to go down to 1-in-a-thousand at times. (Again, this is not 1:1000 chance of any given element colliding; that would be horrible. This is 1:1000 chance of there being a collision at all after hashing some number of elements.) I would not accept 1-in-a-million in situations where an attacker can craft arbitrary things (of arbitrary size) for you to hash. But I'm very comfortable with it for structured data (email addresses, URLs) of constrained length.
If these numbers work for you, then what you want is a hash that is highly uniform in all its bits. And that's a SHA hash. I'd use a SHA-2 (like SHA-256) because you should always use SHA-2 unless you have a good reason not to. Since SHA-2's bits are all independent of each other (or at least that's its intent), you can select any number of its bits to create a shorter hash. So you compute a SHA-256, and take the top (or bottom) 64-bits as an integer, and that's your hash.
As a rule, for modest sized things, you can get away with this in 64 bits. You cannot get away with this in 32 bits. So when you say "NSNumber/Int", I want you to mean explicitly "64-bit integer." For example, on a 32-bit platform, Swift's Int is only 32 bits, so I would use UInt64 or uint64_t, not Int or NSInteger. I recommend unsigned integers here because these are really unique bit patterns, not "numbers" (i.e. it is not meaningful to add or multiply them) and having negative values tends to be confusing in identifiers unless there is some semantic meaning to it.
Note that everything said about hashes here is also true of random numbers, if they're generated by a cryptographic random number generator. In fact, I generally use random numbers for these kinds of problems. For example, if I want clients to generate their own random unique IDs for messages, how many bits do I need to safely avoid collisions? (In many of my systems, you may not be able to use all the bits in your value; some may be used as flags.)
That's my general solution, but there's an even better solution if your input space is constrained. If your input space is smaller than 2^64, then you don't need hashing at all. Obviously, any Latin-1 string up to 8 characters can be stored in a 64-bit value. But if your input is even more constrained, then you can compress the data and get slightly longer strings. It only takes 5 bits to encode 26 symbols, so you can store a 12 letter string (of a single Latin case) in a UInt64 if you're willing to do the math. It's pretty rare that you get lucky enough to use this, but it's worth keeping in the back of your mind when space is at a premium.
I've built a lot of these kinds of systems, and I will say that eventually, we almost always wind up just making a longer identifier. You can make it work on a small identifier, but it's always a little complicated, and there is nothing as effective as just having more bits.... Best of luck till you get there.

Yes, you can create a hashes that are collision resistant using a cryptographic hash function. The output of such a hash function is in bits if you follow the algorithms specifications. However, implementations will generally only return bytes or an encoding of the byte values. A hash does not return a number, as other's have indicated in the comments.
It is relatively easy to convert such a hash into a number of 32 bites such as an Int or Int32. You just take the leftmost bytes of the hash and interpret those to be an unsigned integer.
However, a cryptographic hash has a relatively large output size precisely to make sure that the chance of collisions is small. Collisions are prone to the birthday problem, which means that you only have to try about 2 to the power of hLen divided by 2 inputs to create a collision within the generated set. E.g. you'd need 2^80 tries to create a collision of RIPEMD-160 hashes.
Now for most cryptographic hashes, certainly the common ones, the same rule counts. That means that for 32 bit hash that you'd only need 2^16 hashes to be reasonably sure that you have a collision. That's not good, 65536 tries are very easy to accomplish. And somebody may get lucky, e.g. after 256 tries you'd have a 1 in 256 chance of a collision. That's no good.
So calculating a hash value to use it as ID is fine, but you'd need the full output of a hash function, e.g. 256 bits of SHA-2 to be very sure you don't have a collision. Otherwise you may need to use something line a serial number instead.

Does halving every SHA224 2 bytes to 1 byte to halve the hash length introduce a higher collision risk?

Let's say I have strings that need not be reversible and let's say I use SHA224 to hash it.
The hash of hello world is 2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b and its length is 56 bytes.
What if I convert every two chars to its numerical representation and make a single byte out of them?
In Python I'd do something like this:
shalist = list("2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b")
for first_byte,next_byte in zip(shalist[0::2],shalist[1::2]):
chr(ord(first_byte)+ord(next_byte))
The result will be \x98ek\x9d\x95\x96\x96\xc7\xcb\x9ckhf\x9a\xc7\xc9\xc8\x97\x97\x99\x97\xc9gd\x96im\x94. 28 bytes. Effectively halved the input.
Now, is there a higher hash collision risk by doing so?

The simple answer is pretty obvious: yes, it increases the chance of collision by as many powers of 2 as there are bits missing. For 56 bytes halved to 28 bytes you get the chance of collision increased 2^(28*8). That still leaves the chance of collision at 1:2^(28*8).
Your use of that truncation can be still perfectly legit, depending what it is. Git for example shows only the first few bytes from a commit hash and for most practical purposes the short one works fine.
A "perfect" hash should retain a proportional amount of "effective" bits if you truncate it. For example 32 bits of SHA256 result should have the same "strength" as a 32-bit CRC, although there may be some special properties of CRC that make it more suitable for some purposes while the truncated SHA may be better for others.
If you're doing any kind of security with this it will be difficult to prove your system, you're probably better of using a shorter but complete hash.
Lets shrink the size to make sense of it and use 2 bytes hash instead of 56. The original hash will have 65536 possible values, so if you hash more than that many strings you will surely get a collision. Half that to 1 bytes and you will get a collision after at most 256 strings hashed, regardless do you take the first or the second byte. So your chance of collision is 256 greater (2^(1byte*8bits)) and is 1:256.
Long hashes are used to make it truly impractical to brute-force them, even after long years of cryptanalysis. When MD5 was introduced in 1991 it was considered secure enough to use for certificate signing, in 2008 it was considered "broken" and not suitable for security-related use. Various cryptanalysis techniques can be developed to reduce the "effective" strength of hash and encryption algorithms, so the more spare bits there are (in an otherwise strong algorithm) the more effective bits should remain to keep the hash secure for all practical purposes.

Is it safe to cut the hash?

I would like to store hashes for approximately 2 billion strings. For that purpose I would like to use as less storage as possible.
Consider an ideal hashing algorithm which returns hash as series of hexadecimal digits (like an md5 hash).
As far as i understand the idea this means that i need hash to be not less and not more than 8 symbols in length. Because such hash would be capable of hashing 4+ billion (16 * 16 * 16 * 16 * 16 * 16 * 16 * 16) distinct strings.
So I'd like to know whether it is it safe to cut hash to a certain length to save space ?
(hashes, of course, should not collide)
Yes/No/Maybe - i would appreciate answers with explanations or links to related studies.
P.s. - i know i can test whether 8-character hash would be ok to store 2 billion strings. But i need to compare 2 billion hashes with their 2 billion cutted versions. It doesn't seem trivial to me so i'd better ask before i do that.

The hash is a number, not a string of hexadecimal numbers (characters). In case of MD5, it is 128 bits or 16 bytes saved in efficient form. If your problem still applies, you sure can consider truncating the number (by either coersing into a word or first bitshifting by). Good hash algorithms distribute evenly to all bits.
Addendum:
Generally whenever you deal with hashes, you want to check if the strings really match. This takes care of the possibility of collising hashes. The more you cut the hash the more collisions you're going to get. But it's good to plan for that happening at this phase.

Whether or not its safe to store x values in a hash domain only capable of representing 2x distinct hash values depends entirely on whether you can tolerate collisions.
Hash functions are effectively random number generators, so your 2 billion calculated hash values will be distributed evenly about the 4 billion possible results. This means that you are subject to the Birthday Problem.
In your case, if you calculate 2^31 (2 billion) hashes with only 2^32 (4 billion) possible hash values, the chance of at least two having the same hash (a collision) is very, very nearly 100%. (And the chance of three being the same is also very, very nearly 100%. And so on.) I can't find the formula for calculating the probable number of collisions based on these numbers, but I suspect it is a huge number.
If in your case hash collisions are not a disaster (such as in Java's HashMap implementation which deals with collisions by turning the hash target into a list of objects which share the same hash key, albeit at the cost of reduced performance) then maybe you can live with the certainty of a high number of collisions. But if you need uniqueness then you need either a far, far larger hash domain, or you need to assign each record a guaranteed-unique serial ID number, depending on your purposes.
Finally, note that Keccak is capable of generating any desired output length, so it makes little sense to spend CPU resources generating a long hash output only to trim it down afterwards. You should be able to tell your Keccak function to give only the number of bits you require. (Also note that a change in Keccak output length does not affect the initial output bits, so the result will be exactly the same as if you did a manual bitwise trim afterwards.)

If I use a composite hashing strategy for strings can I virtually eliminate collisions?

Ok so here's the use case. I have lots of somewhat lengthy (200-500 character) strings that I'd like to have a smaller deterministic hash for. Since I can store the full 160-bit SHA1 value in a mere 20 bytes, this yields an order of magnitude space improvement per string.
But of course one has to worry about collisions with hashing on strings even with a crypto hash with decent avalanche effects. I know the chances are infintesimely small, but I'd like to be more conservative. If I do something like this:
hash(input) = CONCAT(HF1(input),HF2(input))
where HF1 is some suitable robust hashing f() and HF2 is another distinct but robust hashing f(). Does this effectively make the chance of a collision near impossible (At the cost of 40 bytes now instead of 20)?
NOTE: I am not concerned with the security/crypto implications of SHA-1 for my use case.
CLARIFICATION: original question was posed about a hashing the concatenated hash value, not concatenating hashes which DOES NOT change the hash collision probabilities of the outer hash function.

Assuming "reasonable" hash functions, then by concatenating, all you're doing is creating a hash function with a larger output space. So yes, this reduces the probability of collision.
But either way, it's probably not worth worrying about. 2^320 is something like the number of particles in the universe. So you only need to worry if you're expecting attackers.

I asked the wrong question initially. This was probably the question I was looking for:
Probability of SHA1 collisions
This was also illuminating
Understanding sha-1 collision weakness
I guess it's fair to ask if I had two hash functions whose concatenated size was smaller than 20 bytes say 2 distinct 32-bit hashing functions. If concatenating those produces a probability that is small enough to ignore in practice since 2 (or even 3) of those concatenated would be smaller than SHA-1.

Hash length reduction?

I know that say given a md5/sha1 of a value, that reducing it from X bits (ie 128) to say Y bits (ie 64 bits) increases the possibility of birthday attacks since information has been lost. Is there any easy to use tool/formula/table that will say what the probability of a "correct" guess will be when that length reduction occurs (compared to its original guess probability)?

Crypto is hard. I would recommend against trying to do this sort of thing. It's like cooking pufferfish: Best left to experts.
So just use the full length hash. And since MD5 is broken and SHA-1 is starting to show cracks, you shouldn't use either in new applications. SHA-2 is probably your best bet right now.

I would definitely recommend against reducing the bit count of hash. There are too many issues at stake here. Firstly, how would you decide which bits to drop?
Secondly, it would be hard to predict how the dropping of those bits would affect the distribution of outputs in the new "shortened" hash function. A (well-designed) hash function is meant to distribute inputs evenly across the whole of the output space, not a subset of it.
By dropping half the bits you are effectively taking a subset of the original hash function, which might not have nearly the desirably properties of a properly-designed hash function, and may lead to further weaknesses.

Well, since every extra bit in the hash provides double the number of possible hashes, every time you shorten the hash by a bit, there are only half as many possible hashes and thus the chances of guessing that random number is doubled.
128 bits = 2^128 possibilities
thus
64 bits = 2^64
so by cutting it in half, you get
2^64 / 2^128 percent
less possibilities