Are there any 63bit hashing algorithms? - hash

Context
I'm implementing a new programming language which has two primitive types: hashes and ints. Hashes will be used to index symbolic data (think hash-consing) while the ints will be standard arithmetic ints. I'd like to have a representation where a value is always represented in a 64 bit machine word and use one bit to distinguish between whether it's a hash or an int. The language will therefore use 63 bit arithmetic for ints and I'd like to use a 63 bit hashing algorithm for the hashes.
I could clearly just mask one bit after hashing values. But hashes will be combined and hashed again as part of modifying the symbolic data and I'm afraid of weakening the hashing algorithm by doing so and increase the risk of collisions. I don't need cryptographic quality hashes but I also don't want to unnecessarily weaken the hashes either.

Related

Non-cryptographic hash functions that are homomorphic with respect to concatenation

Adler32 and CRC have the property that f(a || b) can be computed inexpensively from f(a), f(b), and len(b). Are there any other common non-cryptographic hash functions with this property?
Context (to avoid XY problem) is that I am deduplicating strings by splitting them into chunks, which are indexed by their hash. An input string can then be represented as a sequence of chunks, concatenated. I'd like to use a hash function such that all representations of a string have the same hash, which can be computed directly from the chunk hashes without needing the underlying data, as it is being streamed in unspecified order and thus may not be available in the same place at any one time.
My design calls for roughly 2^32 chunks. Collisions are very expensive, but would not harm correctness. Based on that, I think that CRC64 would work, but I'm curious what my alternatives are. I wouldn't mind a 128 bit hash for future proofing (as in: dataset size may grow).
The probability of one collision among all pairs of your 232 64-bit CRCs is about 1/2. If that's too high for you, you can use a 128-bit CRC. That drops the probability of one collision to 3x10-20.

Hash that generates Decimal output for Swift

I want to hashed a String into a hashed object which has some numerical values NSNumber/Int as an output instead of alpha-numeric values.
The problem is that after digging through swift and some 3rd party library, I'm not able to find any library that suffices our need.
I'm working on a Chat SDK and it takes NSNumber/Int as unique identifier to co-relate Chat Message and Conversation Message.
My company demand is not to store any addition field onto the database
or change the schema that we have which complicates thing.
A neat solution my team came with was some sort of hashed function that generates number.
func userIdToConversationNumber(id:String) -> NSNumber
We can use that function to convert String to NSNumber/Int. This Int should be produced by that function and probability of colliding should be negligible. Any suggestion on any approach.
The key calculation you need to perform is the birthday bound. My favorite table is the one in Wikipedia, and I reference it regularly when I'm designing systems like this one.
The table expresses how many items you can hash for a given hash size before you have a certain expectation of a collision. This is based on a perfectly uniform hash, which a cryptographic hash is a close approximation of.
So for a 64-bit integer, after hashing 6M elements, there is a 1-in-a-million chance that there was a single collision anywhere in that list. After hashing 20M elements, there is a 1-in-a-thousand chance that there was a single collision. And after 5 billion elements, you should bet on a collision (50% chance).
So it all comes down to how many elements you plan to hash and how bad it is if there is a collision (would it create a security problem? can you detect it? can you do anything about it like change the input data?), and of course how much risk you're willing to take for the given problem.
Personally, I'm a 1-in-a-million type of person for these things, though I've been convinced to go down to 1-in-a-thousand at times. (Again, this is not 1:1000 chance of any given element colliding; that would be horrible. This is 1:1000 chance of there being a collision at all after hashing some number of elements.) I would not accept 1-in-a-million in situations where an attacker can craft arbitrary things (of arbitrary size) for you to hash. But I'm very comfortable with it for structured data (email addresses, URLs) of constrained length.
If these numbers work for you, then what you want is a hash that is highly uniform in all its bits. And that's a SHA hash. I'd use a SHA-2 (like SHA-256) because you should always use SHA-2 unless you have a good reason not to. Since SHA-2's bits are all independent of each other (or at least that's its intent), you can select any number of its bits to create a shorter hash. So you compute a SHA-256, and take the top (or bottom) 64-bits as an integer, and that's your hash.
As a rule, for modest sized things, you can get away with this in 64 bits. You cannot get away with this in 32 bits. So when you say "NSNumber/Int", I want you to mean explicitly "64-bit integer." For example, on a 32-bit platform, Swift's Int is only 32 bits, so I would use UInt64 or uint64_t, not Int or NSInteger. I recommend unsigned integers here because these are really unique bit patterns, not "numbers" (i.e. it is not meaningful to add or multiply them) and having negative values tends to be confusing in identifiers unless there is some semantic meaning to it.
Note that everything said about hashes here is also true of random numbers, if they're generated by a cryptographic random number generator. In fact, I generally use random numbers for these kinds of problems. For example, if I want clients to generate their own random unique IDs for messages, how many bits do I need to safely avoid collisions? (In many of my systems, you may not be able to use all the bits in your value; some may be used as flags.)
That's my general solution, but there's an even better solution if your input space is constrained. If your input space is smaller than 2^64, then you don't need hashing at all. Obviously, any Latin-1 string up to 8 characters can be stored in a 64-bit value. But if your input is even more constrained, then you can compress the data and get slightly longer strings. It only takes 5 bits to encode 26 symbols, so you can store a 12 letter string (of a single Latin case) in a UInt64 if you're willing to do the math. It's pretty rare that you get lucky enough to use this, but it's worth keeping in the back of your mind when space is at a premium.
I've built a lot of these kinds of systems, and I will say that eventually, we almost always wind up just making a longer identifier. You can make it work on a small identifier, but it's always a little complicated, and there is nothing as effective as just having more bits.... Best of luck till you get there.
Yes, you can create a hashes that are collision resistant using a cryptographic hash function. The output of such a hash function is in bits if you follow the algorithms specifications. However, implementations will generally only return bytes or an encoding of the byte values. A hash does not return a number, as other's have indicated in the comments.
It is relatively easy to convert such a hash into a number of 32 bites such as an Int or Int32. You just take the leftmost bytes of the hash and interpret those to be an unsigned integer.
However, a cryptographic hash has a relatively large output size precisely to make sure that the chance of collisions is small. Collisions are prone to the birthday problem, which means that you only have to try about 2 to the power of hLen divided by 2 inputs to create a collision within the generated set. E.g. you'd need 2^80 tries to create a collision of RIPEMD-160 hashes.
Now for most cryptographic hashes, certainly the common ones, the same rule counts. That means that for 32 bit hash that you'd only need 2^16 hashes to be reasonably sure that you have a collision. That's not good, 65536 tries are very easy to accomplish. And somebody may get lucky, e.g. after 256 tries you'd have a 1 in 256 chance of a collision. That's no good.
So calculating a hash value to use it as ID is fine, but you'd need the full output of a hash function, e.g. 256 bits of SHA-2 to be very sure you don't have a collision. Otherwise you may need to use something line a serial number instead.

How do I truncate a 64-bit hash into a 32-bit hash? [duplicate]

We're trying to settle an internal debate on our dev team:
We're looking for a 64-bit PHP hash function. We found a PHP implementation of MurmurHash3, but MurmurHash3 is either 32-bit or 128-bit, not 64-bit.
Co-worker #1 believes that to produce a 64-bit hash from MurmurHash3, we can simply slice the first (or last, or any) 64 bits of the 128-bit hash and that it will be as collision-proof as a native 64-bit hash function.
Co-worker #2 believes that we must find a native 64-bit hash function to reduce collisions and that 64-bit slices of a 128-bit hash will not be as collision proof as a native 64-bit hash.
Who's correct?
Does the answer change if we take the first (or last, or any) 64-bits of a cryptographic hash like SHA1 instead of Murmur3?
If you had real random, uniformly distributed values, then "slicing" would yield exactly the same results as if you had started with the smaller value right from the start. To see why, consider this very simple example: Let's say your random generator outputs 3 random bits, but you only need one random bit to work with. Let's assume the output is
b1 b2 b3
The possible values are
000, 001, 010, 011, 100, 101, 110, 111
and all are to occur with equal probability of 1/8. Now whatever bit you slice from those three for your purpose - the first, second or third - the probability of having a '1' is always going to be 1/2, regardless of the position - and the same is true for a '0'.
You can easily scale this experiment to the 64 out of 128 bit case: regardless of which bits you slice, the probability of ending up with a one or a zero in a certain position is going to be one half. What this means is that if you had a sample taken from a uniformly distributed random variable, then slicing wouldn't make the probability for collisions more or less likely.
Now a good question is whether random functions are really the best we can do to prevent collisions. But as it turns out, it can be shown that the probability of finding collisions increases whenever a function deviates from random.
Cryptographic hash functions: co-worker #1 wins
The problem in real life is that hash functions are not random at all, on the contrary, they are boringly deterministic. But a design goal of cryptographic hash functions is as follows: if we didn't know their initial state, then their output would be computationally indistinguishable from a real random function, that is there's no computationally efficient way to tell the difference between the hash output and real random values. This is why you'd consider a hash already as kind of broken if you can find a "distinguisher", a method to tell the hash from real random values with a probability higher than one half. Unfortunately, we can't really prove these properties for existing cryptographic hashes, but unless somebody breaks them, we may assume these properties hold with some confidence. Here is an example of a paper about a distinguisher for one of the SHA-3 submissions that illustrates the process.
To summarize, unless a distinguisher is found for a given cryptographic hash, slicing is perfectly fine and does not increase the probability of a collision.
Non-cryptographic hash functions: co-worker #2 might win
Non-cryptographic hashes do not have to satisfy the same set of requirements as cryptographic hashes do. They are usually defined to be very fast and satisfy certain properties "under sane/benevolent conditions", but they might easily fall short if somebody tries to maliciously manipulate them. A good example for what this means in practice is the computational complexity attack on hash table implementations (hashDoS) presented earlier this year. Under normal conditions, non-crypto hashes work perfectly fine, but their collision resistance may be severely undermined by some clever inputs. This can't happen with cryptographic hash functions, because their very definition requires them to be immune to all sorts of clever inputs.
Because it is possible, sometimes even quite easy, to find a distinguisher like above for the output of non-cryptographic hashes, we can immediately say that they do not qualify as cryptographic hash functions. Being able to tell the difference means that somewhere there is a pattern or bias in the output.
And this fact alone implies that they deviate more or less from a random function, and thus (after what we said above) collisions are probably more likely than they would be for random functions. Finally, since collisions occur with higher probability for the full 128 bits already, this will not get better with shorter ouptputs, collisions will probably be even more likely in that case.
tl;dr You're safe with a cryptographic hash function when truncating it. But you're better off with a "native" 64 bit cryptographic hash function compared to truncating a non-cryptographic hash with a larger output to 64 bits.
Due to the avalanche effect, a strong hash is one where a single bit of change in the source results in half the bits of the hash flipping on average. For a good hash, then, the "hashness" is evenly distributed, and so each section or slice is affected by an equal and evenly distributed amount of source bits, and therefore is just as strong as any other slice of the same bit length could be.
I would agree with co-worker 1 as long as the hash has good properties and even distribution.
This question seems incomplete without this being mentioned:
Some hashes are provably perfect hashes for a specific class of inputs (eg., for input of length n for some reasonable value of n). If you truncate that hash then you are likely to destroy that property, in which case you are, by definition, increasing the rate of collisions from zero to non-zero and you have weakened the hash in that use case.
It's not the general case, but it's an example of a legitimate concern when truncating hashes.

If I use a composite hashing strategy for strings can I virtually eliminate collisions?

Ok so here's the use case. I have lots of somewhat lengthy (200-500 character) strings that I'd like to have a smaller deterministic hash for. Since I can store the full 160-bit SHA1 value in a mere 20 bytes, this yields an order of magnitude space improvement per string.
But of course one has to worry about collisions with hashing on strings even with a crypto hash with decent avalanche effects. I know the chances are infintesimely small, but I'd like to be more conservative. If I do something like this:
hash(input) = CONCAT(HF1(input),HF2(input))
where HF1 is some suitable robust hashing f() and HF2 is another distinct but robust hashing f(). Does this effectively make the chance of a collision near impossible (At the cost of 40 bytes now instead of 20)?
NOTE: I am not concerned with the security/crypto implications of SHA-1 for my use case.
CLARIFICATION: original question was posed about a hashing the concatenated hash value, not concatenating hashes which DOES NOT change the hash collision probabilities of the outer hash function.
Assuming "reasonable" hash functions, then by concatenating, all you're doing is creating a hash function with a larger output space. So yes, this reduces the probability of collision.
But either way, it's probably not worth worrying about. 2^320 is something like the number of particles in the universe. So you only need to worry if you're expecting attackers.
I asked the wrong question initially. This was probably the question I was looking for:
Probability of SHA1 collisions
This was also illuminating
Understanding sha-1 collision weakness
I guess it's fair to ask if I had two hash functions whose concatenated size was smaller than 20 bytes say 2 distinct 32-bit hashing functions. If concatenating those produces a probability that is small enough to ignore in practice since 2 (or even 3) of those concatenated would be smaller than SHA-1.

512 bit hash vs 4 128bit hash

Interestingly I haven't found enough information regarding any test or experiment of collision chances of single 512bit hash like whirlpool versus concatenation of 4 128bit hashes like md5, sha1 etc.
Possibility of 4 128bit hashes to appear same seems less probable than single 512bit hash when the data on which hashing is performed is considerably of small size merely on average 100 characters.
But its just an apparent guess with no basis because I haven't performed any test. What you think about it?
Edit its like
512bit hash vs 128bit hash . 128bit hash . 128bit hash . 128bit hash (4 128bit hash concatenated)
Edit2
I want to use hash for this index on url or hashing considering RAM
and purpose is to minimize the possibility of collision because I want to set hash column as unique instead of url column.
Edit3
Please note that purpose of this question is to find the way to minimize the possibility of collision. Having said that, Why I need to focus more on minimizing the possibility of collision? Here comes my Edit2 description which leads to finding the solution to use less RAM. So, interests are both in minimizing the collision and lower RAM usage. But prime focus of this question is lowering the possibility of collision.
It sounds like you want to compare the collision behaviour of:
hash512(x)
with the collision behaviour of:
hash128_a(x) . hash128_b(x) . hash128_c(x) . hash128_d(x)
where "." denotes concatenation, and hash128_a, hash128_b, etc. are four different 128-bit hash algorithms.
The answer is: it depends entirely on the properties of the individual hashes involved.
Consider, for instance that the 128-bit hash functions could be implemented as:
uint128_t hash128_a(T x) { return hash512(x)[ 0:127]; }
uint128_t hash128_b(T x) { return hash512(x)[128:255]; }
uint128_t hash128_c(T x) { return hash512(x)[256:383]; }
uint128_t hash128_d(T x) { return hash512(x)[384:511]; }
In which case, the performance would be identical.
The classical article to read on that question is due to Hoch and Shamir. It builds on previous discoveries, especially by Joux. Bottom-line is the following: if you take four hash functions with a 128-bit output, and the four hash functions use the Merkle-Damgård construction, then finding a collision for the whole 512-bit ouput is no more difficult than finding a collision for one of the hash functions. MD5, SHA-1... use the MD construction.
On the other hand, if some of your hash functions use a distinct structure, in particular with a wider running state, the concatenation could yield a stronger function. See the example from #Oli: if all four functions are SHA-512 with some surgery on the output, then the concatenated hash function could be plain SHA-512.
The only sure thing about the concatenation of four hash functions is that the result will be no less collision-resistant than the strongest of the four hash functions. This has been used within SSL/TLS, which, up to version 1.1, internally uses concurrently both MD5 and SHA-1 in an attempt to resist breaks on either.
512 bits is 512 bits. The only difference is in the difference in imperfections in the hashes. The best overall hash would be a 512 using the best algorithm available.
Edit to add clarification, because it's too long for a comment:
An ideal hash maps content uniformly onto x bits. If you have 4 (completely independent) x-bit hashes, that maps the file uniformly onto 4x bits; a 4x-bit hash still maps the same file uniformly onto 4x bits. 4x bits is 4x bits; as long as it's perfectly uniform, it doesn't matter whether it comes from one (4x) hash function or 4 (x). However, no hash can be completely ideal, so you want the most uniform obtainable distribution, and if you use 4 different functions, only 1 can be the closest to optimal so you have x optimal bits and 3x suboptimal, whereas a single algorithm can cover the entire 4x space with the most optimal distribution.
I suppose it is possible that enough larger algorithms could have subsets of bits that are more uniformly distributed than a single 512, and could be combined to get more uniformity, but that seems like it would be a great deal extra research and implementation for little potential benefit.
If you are comparing concatenating four different 'ideal' 128bit hashing algorithms with one ideal 512 bit hashing algorithm, then yes, both methods will get you the same probability of a collision. Using md5 would make it easier to crack a hash though. If an attacker for example knew you were doing md5 + md5 w/ salt + md5 with another salt .. then that would be much easier to crack as md5 collision attack. Look here for more information about hash functions that have known attacks.