Regular hash functions, in which collisions are probable, run in constant time: O(1). But what is the time complexity of a perfect hash function? Is it 1?
If the hash function is intended to be used to access a hash table, then there is no difference in terms of complexity between perfect and regular hash functions, since both of them may also create collisions in the table. The reason is that the index associated to an element in a hash table is the remainder of the division of the hash by the length of the table (usually a prime number). This is why two elements which hash to different values will collide if their remainder modulo the (said) prime is the same for both of them. This means that the time complexity of accessing the table is O(1) in both cases.
Note also that the computation of the hash usually depends on the size of the input. For instance, if the elements to be hashed are strings, good hashes take all their characters into account. Therefore, for the complexity to remain O(1), one has to limit the size (or length) of the inputs. Again, this applies to both perfect and regular hashes.
Related
Adler32 and CRC have the property that f(a || b) can be computed inexpensively from f(a), f(b), and len(b). Are there any other common non-cryptographic hash functions with this property?
Context (to avoid XY problem) is that I am deduplicating strings by splitting them into chunks, which are indexed by their hash. An input string can then be represented as a sequence of chunks, concatenated. I'd like to use a hash function such that all representations of a string have the same hash, which can be computed directly from the chunk hashes without needing the underlying data, as it is being streamed in unspecified order and thus may not be available in the same place at any one time.
My design calls for roughly 2^32 chunks. Collisions are very expensive, but would not harm correctness. Based on that, I think that CRC64 would work, but I'm curious what my alternatives are. I wouldn't mind a 128 bit hash for future proofing (as in: dataset size may grow).
The probability of one collision among all pairs of your 232 64-bit CRCs is about 1/2. If that's too high for you, you can use a 128-bit CRC. That drops the probability of one collision to 3x10-20.
I often see or hear of modulus being used as a last step of hashing or after hashing. e.g. h(input)%N where h is the hash function and % is the modulus operator. If I am designing a hash table, and want to map a large set of keys to a smaller space of indices for the hash table, doesn't the modulus operator achieve that? Furthermore, if I wanted to randomize the distribution across those locations within the hash table, is the remainder generated by modulus not sufficient? What does the hashing function h provide on top of the modulus operator?
I often see or hear of modulus being used as a last step of hashing or after hashing. e.g. h( input ) % N where h is the hash function and % is the modulus operator.
Indeed.
If I am designing a hash table, and want to map a large set of keys to a smaller space of indices for the hash table, doesn't the modulus operator achieve that?
That's precisely the purpose of the modulo operator: to restrict the range of array indexes, so yes.
But you cannot simply use the modulo operator by itself: the modulo operator requires an integer value: you cannot get the "modulo of a string over N" or "modulo of an object-graph over N"[1].
Furthermore, if I wanted to randomize the distribution across those locations within the hash table, is the remainder generated by modulus not sufficient?
No, it does not - because the modulo operator doesn't give you pseudorandom output - nor does it have any kind of avalanche effect - which means that similar input values will have similar output hashes, which will result in clustering in your hashtable bins, which will result in subpar performance due to the greatly increased likelihood of hash-collisions (and so requiring slower techniques like linear-probing which defeat the purpose of a hashtable because you lose O(1) lookup times.
What does the hashing function h provide on top of the modulus operator?
The domain of h can be anything, especially non-integer values.
[1] Technically speaking, this is possible if you use the value of the memory address of an object (i.e. an object pointer), but that doesn't work if you have hashtable keys that don't use object identity, such as a stack-allocated object or custom struct.
First, the hash function's primary purpose is to turn something that's not a number into a number. Even if you just use modulus after that to get a number in your range, getting the number is still the first step and is the responsibility of the hash function. If you're hashing integers and you just use the integers as their own hashes, it isn't that there's no hash function, it's that you've chosen the identity function as your hash function. If you don't write out the function, that means you inlined it.
Second, the hash function can provide a more unpredictable distribution to reduce the likelihood of unintentional collisions. The data people work with often contain patterns and if you're just using a simple identity function with modulus, the pattern in inputs may be such that the modulus is more likely to cause collisions. The hash function presents an opportunity to break this up so it becomes unlikely that modulus exposes patterns in the original data sequence.
Is it done in O(1) or O(n) or somewhere in between? Is there any disadvantage to computing the hash of a very large object vs a small one? If it matters, I'm using Python.
Generally speaking, computing a hash will be O(1) for "small" items and O(N) for "large" items (where "N" denotes the size of an item's key). The precise dividing line between small and large varies, but is typically somewhere in the general vicinity of the size of a register (e.g., 32 bits on a 32-bit machine, 64 bits on a 64-bit machine). This can also depend on the input type--for example, integer types up on the register size all hashing with constant complexity, but strings taking time proportional to the size in bytes, right down to a single character (i.e., a two-character string taking roughly twice the time of a single character string).
Once you've computed the hash, accessing the hash table has expected constant complexity, but can be as bad as O(N) in the worst case (but this is a different "N"--the number of items inserted in the table, not the size of an individual key).
The real answer is it depends. You didn't specify what hash function you are interested in. When we are talking about cryptographic hash like SHA256, then complexity is O(n). When we are talking about hash function that take last two digits of phone number, then it will be O(1). Hash functions that are used in hash tables tend to be optimized for speed and thus are closer to O(1).
For further reference on hash tables see this page from python wiki on Time Complexity.
Most of the time your hash is going to compute in access at O(1). However, if it is a really bad hash where every value has the same hash, it will be O(n) worst case.
The more objects associated to the hash is equivalent to more collisions.
A HashMap (or) HashTable is an example of keyed array. Here, the indices are user-defined keys rather than the usual index number. For example, arr["first"]=99 is an example of a hashmap where theb key is first and the value is 99.
Since keys are used, a hashing function is required to convert the key to an index element and then insert/search data in the array. This process assumes that there are no collisions.
Now, given a key to be searched in the array and if present, the data must be fetched. So, every time, the key must be converted to an index number of the array before the search. So how does that take a O(1) time? Because, the time complexity is dependent on the hashing function also. So the time complexity must be O(hashing function's time).
When talking about hashing, we usually measure the performance of a hash table by talking about the expected number of probes that we need to make when searching for an element in the table. In most hashing setups, we can prove that the expected number of probes is O(1). Usually, we then jump from there to "so the expected runtime of a hash table lookup is O(1)."
This isn't necessarily the case, though. As you've pointed out, the cost of computing the hash function on a particular input might not always take time O(1). Similarly, the cost of comparing two elements in the hash table might also not take time O(1). Think about hashing strings or lists, for example.
That said, what is usually true is the following. If we let the total number of elements in the table be n, we can say that the expected cost of performing a looking up the hash table is independent of the number n. That is, it doesn't matter whether there are 1,000,000 elements in the hash table or 10100 - the number of spots you need to prove is, on average, the same. Therefore, we can say that the expected cost of performing a lookup in a hash table, as a function of the hash table size, is O(1) because the cost of performing a lookup doesn't depend on the table size.
Perhaps the best way to account for the cost of a lookup in a hash table would be to say that it's O(Thash + Teq), where Thash is the time required to hash an element and Teq is the time required to compare two elements in the table. For strings, for example, you could say that the expected cost of a lookup is O(L + Lmax), where L is the length of the string you're hashing and Lmax is the length of the longest string stored in the hash table.
Hope this helps!
So hash tables are really cool for constant-time lookups of data in sets, but as I understand they are limited by possible hashing collisions which leads to increased small amounts of time-complexity.
It seems to me like any hashing function that supports a non-finite range of inputs is really a heuristic for reducing collision. Are there any absolute limitations to creating a perfect hash table for any range of inputs, or is it just something that no one has figured out yet?
I think this depends on what you mean by "any range of inputs."
If your goal is to create a hash function that can take in anything and never produce a collision, then there's no way to do what you're asking. This is a consequence of the pigeonhole principle - if you have n objects that can be hashed, you need at least n distinct outputs for your hash function or you're forced to get at least one hash collision. If there are infinitely many possible input objects, then no finite hash table could be built that will always avoid collisions.
On the other hand, if your goal is to build a hash table where lookups are worst-case O(1) (that is, you only have to look at a fixed number of locations to find any element), then there are many different options available. You could use a dynamic perfect hash table or a cuckoo hash table, which supports worst-case O(1) lookups and expected O(1) insertions and deletions. These hash tables work by using a variety of different hash functions rather than any one fixed hash function, which helps circumvent the above restriction.
Hope this helps!