Is it done in O(1) or O(n) or somewhere in between? Is there any disadvantage to computing the hash of a very large object vs a small one? If it matters, I'm using Python.
Generally speaking, computing a hash will be O(1) for "small" items and O(N) for "large" items (where "N" denotes the size of an item's key). The precise dividing line between small and large varies, but is typically somewhere in the general vicinity of the size of a register (e.g., 32 bits on a 32-bit machine, 64 bits on a 64-bit machine). This can also depend on the input type--for example, integer types up on the register size all hashing with constant complexity, but strings taking time proportional to the size in bytes, right down to a single character (i.e., a two-character string taking roughly twice the time of a single character string).
Once you've computed the hash, accessing the hash table has expected constant complexity, but can be as bad as O(N) in the worst case (but this is a different "N"--the number of items inserted in the table, not the size of an individual key).
The real answer is it depends. You didn't specify what hash function you are interested in. When we are talking about cryptographic hash like SHA256, then complexity is O(n). When we are talking about hash function that take last two digits of phone number, then it will be O(1). Hash functions that are used in hash tables tend to be optimized for speed and thus are closer to O(1).
For further reference on hash tables see this page from python wiki on Time Complexity.
Most of the time your hash is going to compute in access at O(1). However, if it is a really bad hash where every value has the same hash, it will be O(n) worst case.
The more objects associated to the hash is equivalent to more collisions.
Related
Adler32 and CRC have the property that f(a || b) can be computed inexpensively from f(a), f(b), and len(b). Are there any other common non-cryptographic hash functions with this property?
Context (to avoid XY problem) is that I am deduplicating strings by splitting them into chunks, which are indexed by their hash. An input string can then be represented as a sequence of chunks, concatenated. I'd like to use a hash function such that all representations of a string have the same hash, which can be computed directly from the chunk hashes without needing the underlying data, as it is being streamed in unspecified order and thus may not be available in the same place at any one time.
My design calls for roughly 2^32 chunks. Collisions are very expensive, but would not harm correctness. Based on that, I think that CRC64 would work, but I'm curious what my alternatives are. I wouldn't mind a 128 bit hash for future proofing (as in: dataset size may grow).
The probability of one collision among all pairs of your 232 64-bit CRCs is about 1/2. If that's too high for you, you can use a 128-bit CRC. That drops the probability of one collision to 3x10-20.
I'm trying to understand hash tables, and from what I've seen the modulo operator is used to select which bucket a key will be placed in. I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation. Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
I'm sure I'm misunderstanding this. Can someone explain?
Don't you get a random number after doing modulo on a hashed number?
It depends on the hash function.
Say you have an identify hash for numbers - h(n) = n - then if the keys being hashed are generally incrementing numbers (perhaps with an occasional ommision), then after hashing they'll still generally hit successive buckets (wrapping at some point from the last bucket back to the first), with low collision rates overall. Not very random, but works out well enough. If the keys are random, it still works out pretty well - see the discussion of random-but-repeatable hashing below. The problem is when the keys are neither roughly-incrementing nor close-to-random - then an identity hash can provide terrible collision rates. (You might think "this is a crazy bad example hash function, nobody would do this; actually, most C++ Standard Library implementations' hash functions for integers are identity hashes).
On the other hand, if you have a hash function that say takes the address of the object being hashed, and they're all 8 byte aligned, then if you take the mod and the bucket count is also a multiple of 8, you'll only ever hash to every 8th bucket, having 8 times more collisions than you might expect. Not very random, and doesn't work out well. But, if the number of buckets is a prime, then the addresses will tend to scatter much more randomly over the buckets, and things will work out much better. This is the reason the GNU C++ Standard Library tends to use prime numbers of buckets (Visual C++ uses power-of-two sized buckets so it can utilise a bitwise AND for mapping hash values to buckets, as AND takes one CPU cycle and MOD can take e.g. 30-40 cycles - depending on your exact CPU - see here).
When all the inputs are known at compile time, and there's not too many of them, then it's generally possible to create a perfect hash function (GNU gperf software is designed specifically for this), which means it will work out a number of buckets you'll need and a hash function that avoids any collisions, but the hash function may take longer to run than a general purpose function.
People often have a fanciful notion - also seen in the question - that a "perfect hash function" - or at least one that has very few collisions - in some large numerical hashed-to range will provide minimal collisions in actual usage in a hash table, as indeed this stackoverflow question is about coming to grips with the falsehood of this notion. It's just not true if there are still patterns and probabilities in the way the keys map into that large hashed-to range.
The gold standard for a general purpose high-quality hash function for runtime inputs is to have a quality that you might call "random but repeatable", even before the modulo operation, as that quality will apply to the bucket selection as well (even using the dumber and less forgiving AND bit-masking approach to bucket selection).
As you've noticed, this does mean you'll see collisions in the table. If you can exploit patterns in the keys to get less collisions that this random-but-repeatable quality would give you, then by all means make the most of that. If not, the beauty of hashing is that with random-but-repeatable hashing your collisions are statistically related to your load factor (the number of stored elements divided by the number of buckets).
As an example, for separate chaining - when your load factor is 1.0, 1/eć(~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.
I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation.
This is still true post-modulo. Minimising the same result means each post-modulo value has (about) the same number of keys mapping to it. We're particularly concerned about in-use keys stored in the table, if there's a non-uniform statistical distribution to the use of keys. With a hash function that exhibits the random-but-repeatable quality, there will be random variation in post-modulo mapping, but overall they'll be close enough to evenly balanced for most practical purposes.
Just to recap, let me address this directly:
Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
So:
random is good: if you get something like the random-but-repeatable hash quality, then your average hash collisions will statistically be capped at low levels, and in practice you're unlikely to ever see a particularly horrible collision chain, provided you keep the load factor reasonable (e.g. <= 1.0)
that said, your "near-perfect hash function...between 0 and 100,000" may or may not be high quality, depending on whether the distribution of values has patterns in it that would produce collisions. When in doubt about such patterns, use a hash function with the random-but-repeatable quality.
What would happen if you took a random number instead of using a hash function? Then doing the modulo on it? If you call rand() twice you can get the same number - a proper hash function doesn't do that I guess, or does it? Even hash functions can output the same value for different input.
This comment shows you grappling with the desirability of randomness - hopefully with earlier parts of my answer you're now clear on this, but anyway the point is that randomness is good, but it has to be repeatable: the same key has to produce the same pre-modulo hash so the post-modulo value tells you the bucket it should be in.
As an example of random-but-repeatable, imagine you used rand() to populate a uint32_t a[256][8] array, you could then hash any 8 byte key (e.g. including e.g. a double) by XORing the random numbers:
auto h(double d) {
uint8_t i[8];
memcpy(i, &d, 8);
return a[i[0]] ^ a[i[1]] ^ a[i[2]] ^ ... ^ a[i[7]];
}
This would produce a near-ideal (rand() isn't a great quality pseudo-random number generator) random-but-repeatable hash, but having a hash function that needs to consult largish chunks of memory can easily be slowed down by cache misses.
Following on from what [Mureinik] said, assuming you have a perfect hash function, say your array/buckets are 75% full, then doing modulo on the hashed function will probably result in a 75% collision probability. If that's true, I thought they were much better. Though I'm only learning about how they work now.
The 75%/75% thing is correct for a high quality hash function, assuming:
closed hashing / open addressing, where collisions are handled by finding an alternative bucket, or
separate chaining when 75% of buckets have one or more elements linked therefrom (which is very likely to mean the load factor (which many people may think of when you talk about how "full" the table is) is already significantly more than 75%)
Regarding "I thought they were much better." - that's actually quite ok, as evidenced by the percentages of colliding chain lengths mentioned earlier in my answer.
I think you have the right understanding of the situation.
Both the hash function and the number of buckets affect the chance of collisions. Consider, for example, the worst possible hash function - one that returns a constant value. No matter how many buckets you have, all the entries will be lumped to the same bucket, and you'd have a 100% chance of collision.
On the other hand, if you have a (near) perfect hash function, the number of buckets would be the main factor for the chance of collision. If your hash table has only 20 buckets, the minimal chance of collision will indeed be 1 in 20 (over time). If the hash values weren't uniformly spread, you'd have a much higher chance of collision in at least one of the buckets. The more buckets you have, the less chance of collision. On the other hand, having too many buckets will take up more memory (even if they are empty), and ultimately reduce performance, even if there are less collisions.
Regular hash functions, in which collisions are probable, run in constant time: O(1). But what is the time complexity of a perfect hash function? Is it 1?
If the hash function is intended to be used to access a hash table, then there is no difference in terms of complexity between perfect and regular hash functions, since both of them may also create collisions in the table. The reason is that the index associated to an element in a hash table is the remainder of the division of the hash by the length of the table (usually a prime number). This is why two elements which hash to different values will collide if their remainder modulo the (said) prime is the same for both of them. This means that the time complexity of accessing the table is O(1) in both cases.
Note also that the computation of the hash usually depends on the size of the input. For instance, if the elements to be hashed are strings, good hashes take all their characters into account. Therefore, for the complexity to remain O(1), one has to limit the size (or length) of the inputs. Again, this applies to both perfect and regular hashes.
If I have some data I hash with SHA256 like this :- hash=SHA256(data)
And then copy only the first 8 bytes of the hash instead of the whole 32 bytes, how easy is it to find a hash collision with different data? Is it 2^64 or 2^32 ?
If I need to reduce a hash of some data to a smaller size (n bits) is there any way to ensure the search space 2^n ?
I think you're actually interested in three things.
The first you need to understand is the entropy distribution of the hash. If the output of a hash function is n-bits long, then the maximum entropy is n bits. Note that I say maximum; you are never guaranteed to have n bits of entropy. Similarly, if you truncate the hash output to n/4 bits, you are not guaranteed to have a 2n/4 bits of entropy in the result. SHA-256 is fairly uniformly distributed, which means in part that you are unlikely to have more entropy in the high bits than the low bits (or vice versa).
However, information on this is sparse because the hash function is intended to be used with its whole hash output. If you only need an 8-byte hash output, then you might not even need a cryptographic hash function and could consider other algorithms. (The point is, if you need a cryptographic hash function, then you need as many bits as it can give you, as shortening the output weakens the security of the function.)
The second is the search space: it is not dependent on the hash function at all. Searching for an input that creates a given output on a hash function is more commonly known as a Brute-Force attack. The number of inputs that will have to be searched does not depend on the hash function itself; how could it? Every hash function output is the same: every SHA-256 output is 256 bits. If you just need a collision, you could find one specific input that generated each possible output of 256 bits. Unfortunately, this would take up a minimum storage space of 256 * 2256 ā 3 * 1079 for just the hash values themselves (i.e. not counting the inputs needed to generate them), which vastly eclipses the entire hard drive capacity of the entire world.
Therefore, the search space depends on the complexity and length of the input to the hash function. If your data is 8-character long ASCII strings, then you're pretty well guaranteed to never have a collision, BUT the search space for those hash values is only 27*8 ā 7.2 * 1016, which could be searched by your computer in a few minutes, probably. After all, you don't need to find a collision if you can find the original input itself. This is why salts are important in cryptography.
Third, you're interested in knowing the collision resistance. As GregS' linked article points out, the collision resistance of a space is much more limited than the input search space due to the pigeonhole principle.
Every hash function with more inputs than outputs will necessarily have collisions. Consider a hash function such as SHA-256 that produces 256 bits of output from an arbitrarily large input. Since it must generate one of 2256 outputs for each member of a much larger set of inputs, the pigeonhole principle guarantees that some inputs will hash to the same output. Collision resistance doesn't mean that no collisions exist; simply that they are hard to find.
The "birthday paradox" places an upper bound on collision resistance: if a hash function produces N bits of output, an attacker who computes "only" 2N/2 (or sqrt(2N)) hash operations on random input is likely to find two matching outputs. If there is an easier method than this brute force attack, it is typically considered a flaw in the hash function.
So consider what happens when you examine and store only the first 8 bytes (one fourth) of your output. Your collision resistance has dropped from 2256/2 = 2128 to 264/2 = 232. How much smaller is 232 than 2128? It's a whole lot smaller, as it turns out, approximately 0.0000000000000000000000000001% of the size at best.
Say a known SHA1 hash was calculated by concatenating several chunks of data and that the order in which the chunks were concatenated is unknown. The straight forward way to find the order of the chunks that gives the known hash would be to calculate an SHA1 hash for each possible ordering until the known hash is found.
Is it possible to speed this up by calculating an SHA1 hash separately for each chunk and then find the order of the chunks by only manipulating the hashes?
In short, No.
If you are using SHA-1, due to Avalanche Effect ,any tiny change in the plaintext (in your case, your chunks) would alter its corresponding SHA-1 significantly.
Say if you have 4 chunks : A B C and D,
the SHA1 hash of A+B+C+D (concated) is supposed to be uncorrelated with the SHA1 hash for A, B, C and D computed as separately.
Since they are unrelated, you cannot draw any relationship between the concated chunk (A+B+C+D, B+C+A+D etc) and each individual chunk (A,B,C or D).
If you could identify any relationship in-between, the SHA1 hashing algorithm would be in trouble.
Practical answer: no. If the hash function you use is any good, then it is supposed to look like a Random Oracle, the output of which on an exact given input being totally unknown until that input is tried. So you cannot infer anything from the hashes you compute until you hit the exact input ordering that you are looking for. (Strictly speaking, there could exist a hash function which has the usual properties of a hash function, namely collision and preimage resistances, without being a random oracle, but departing from the RO model is still considered as a hash function weakness.)(Still strictly speaking, it is slightly improper to talk about a random oracle for a single, unkeyed function.)
Theoretical answer: it depends. Assuming, for simplicity, that you have N chunks of 512 bits, then you can arrange for the cost not to exceed N*2160 elementary evaluations of SHA-1, which is lower than N! when N >= 42. The idea is that the running state of SHA-1, between two successive blocks, is limited to 160 bits. Of course, that cost is ridiculously infeasible anyway. More generally, your problem is about finding a preimage to SHA-1 with inputs in a custom set S (the N! sequences of your N chunks) so the cost has a lower bound of the size of S and the preimage resistance of SHA-1, whichever is lower. The size of S is N!, which grows very fast when N is increased. SHA-1 has no known weakness with regards to preimages, so its resistance is still assumed to be about 2160 (since it has a 160-bit output).
Edit: this kind of question would be appropriate on the proposed "cryptography" stack exchange, when (if) it is instantiated. Please commit to help create it !
Depending on your hashing library, something like this may work: Say you have blocks A, B, C, and D. You can process the hash for block A, and then clone that state and calculate A+B, A+C, and A+D without having to recalculate A each time. And then you can clone each of those to calculate A+B+C and A+B+D from A+B, A+C+B and A+C+D from A+C, and so on.
Nope. Calculating the complete SHA1 hash requires that the chunks be put in in order. The calculation of the next hash chunk requires the output of the current one. If that wasn't true then it would be much easier to manipulate documents so that you could reorder the chunks at will, which would greatly decrease the usefulness of the algorithm.