Cuckoo Hashing: What is the best way to detect collisions in hash functions? - hash

I implemented a hashmap based on cuckoo hashing.
My hash functions take values of any length and return keys of type long. To match the keys to my array size n, I do key % n.
I'm thinking about following scenario:
Insert value A with key A.key into location A.key % n
Find value B with key A.key
So for this example I get the entry for value A and it is not recognized that value B hasn't even been inserted. This happens if my hash function returns the same key for two different values. Collisions with different keys but same locations are no problem.
What is the best way to detect those collisions?
Do I have to check every time I insert or search an item if the original values are equal?

As with most hashing schemes, in cuckoo hashing, the hash code tells you where to look in the table for the element in question, but the expectation is that you store both the key and the value in the table so that before returning the stored value, you first check the key stored at that slot against the key you're looking for. That way, if you get the same hash code for two objects, you can determine which object was stored at that slot.

Related

Any way to get orginal data from hashed values in snowflake?

I have a table which uses the snowflake hash function to store values in some columns.
Is there any way to reverse the encrytion from the hash function and get the original values from the table?
As per the documentation, the function is not "not a cryptographic hash function", and will always return the same result for the same input expression.
Example :
select hash(1) always returns -4730168494964875235
select hash('a') always returns -947125324004678632
select hash('1234') always returns -4035663806895772878
I was wondering if there is any way to reverse the hashing and get the original input expression from the hashed values.
I think these disclaimers are for preventing potential legal disputes:
Cryptographic hash functions have a few properties which this function
does not, for example:
The cryptographic hashing of a value cannot be inverted to find the
original value.
It's not possible to reserve a hash value in general. If you consider that when you even send a very long text, and it is represented in a 64-bit value, it's obvious that the data is not preserved. On the other hand, if you use a brute force technique, you may find the actual value producing the hash, and it can be counted as reserving the hash value.
For example, if you store all hash values for the numbers between 0 and 5000 in a table, when I came with hash value '-7875472545445966613', you can look up that value in your table, and say it belongs to 1000 (number).

Key, Value, Hash and Hash function for HashTable

I'm having trouble understanding what the Hash Function does and doesn't do, as well as what exactly a Bucket is.
From my understanding:
A HashTable is a data structure that maps keys to values using a Hash Function.
A HashFunction is meant to map data from an array of arbitrary/unknown size to a data array of fixed size.
There can be duplicate Values in the original data array, but this is irrelevant.
Each Value will have a unique Key. Thus, each Key has exactly 1 Value.
The HashFunction will generate a HashCode for each (Value, Key) pair. However, Collisions can occur in which multiple (Value, Key) pairs map to the same HashCode.
This can be remedied by using either Chaining/Open Addressing methods.
The HashCode is the index value indicating the position of a particular entry from the original data array within the Bucket array.
The Bucket array is the fixed data array constructed that will contain the entries from the original array.
My questions:
How are the Keys generated for each value? Is the HashFunction meant to generate both Key and HashCode values for each entry? Does each Bucket thus contain only one entry (assuming a Chaining implementation to remedy Collision)?
How are the Keys generated for each value?
Key is not generated, it is provided by you and serves as an input to the hash function which in turn converts that key into index of hash table. Simply speaking:
H(key)=index
so the value you are looking for is:
hash_table[index] = value
Is the HashFunction meant to generate HashCode values for each entry?
It all depends on the implementation of hash function and hash table. Some hash functions might generate a hashcode out of provided key and then for example take its modulo(size) where size is the size of hash table, in order to get the index. Others might convert the key directly into index. In either case the ultimate goal of hash function is to find the location of searched data within hash table in constant time.
Does each Bucket thus contain only one entry (assuming a Chaining implementation to remedy Collision)?
Ideally each key should be mapped to a unique index but mostly that's not the case since the number of buckets (i.e. indices) is far smaller than the number of keys so the average length of a chain per bucket (i.e. number of collisions per bucket) is no.of keys/no.of indices

How are same hash vs same key handled?

This question is not specific to any programming language, I am more interested in a generic logic.
Generally, associative maps take a key and map it to a value. As far as I know, implementations require the keys to be unique otherwise values get overwritten. Alright.
So let us assume that the above is done by some hash implementation.
What if two DIFFERENT keys get the same hash value? I am thinking of this in the form of an underlying array whose indices are in a result of hash on said keys. It could be possible that more than one unique key gets mapped to the same value yes? If so, how does such an implementation handle this?
How is handling same hash different from handling same key? Since same key results in overwriting and same hash HAS to retain the value.
I understand hashing with collision, so I know chaining and probing. Do implementations iterate over the current values which are hashed to a particular index and determine if the key is the same?
While I was searching for the answer I came across these links:
1. What happens when a duplicate key is put into a HashMap?
2. HashMap with multiple values under the same key
They don't answer my question however. How do we distinguish between same hash vs same key?
By comparing the keys. If you look at object-oriented implementations of hash maps, you'll find that they usually require two methods to be implemented on the key type:
bool equal(Key key1, Key key2);
int hash(Key key);
If only the hash function can be given and no equality function, that restricts the hash map to be based on the language's default equality. This is not always desirable as sometimes keys need to be compared with a different equality function. For example, if the keys are strings, an application may need to do a case-insensitive comparison, and then it would pass a hash function that converts to lowercase before hashing, and an equal function that ignores case.
The hash map stores the key alongside each corresponding value. (Usually, that's a pointer to the key object that was originally stored.) Any lookup into the hash map has to make a key comparison after finding a matching hash, to verify that the key actually matches.
For example, for a very simple hash map that stores a list in each bucket, the list would be a list of (key, value) pairs, and any lookup compares the keys of each list entry until it finds a match. In pseudocode:
Array<List<Pair<Key, Value>>> buckets;
Value lookup(Key k_sought) {
int h = hash(k_sought);
List<Pair<Key, Value>> bucket = buckets[h];
for (kv in bucket) {
Key k_found = kv.0;
Value v_found = kv.1;
if (equal(k_sought, k_found)) {
return v_found;
}
}
throw Not_found;
}
You can not tell what a key is from the index, so no you can not iterate over the values to find any information about the keys. You will either have to guarantee 0 collisions or store the information that was hashed to give the index.
If you only have values stored in your structure, there is no way to tell if they have the same key or just the same hash. You will need to store the key along with the value to know.

Hashing : Insertion to a Deleted Slot

I'm new to hashing and here's my question:
Can you insert in a DELETED slot of the hash table?
Yes, you can insert to a deleted slot. But...
At first you should know that there is soft-deletion and hard-deletion. In soft-delete you just flip a flag and mark your slot as "deleted", in hard-delete you empty the slot.
Let me explain why we need soft-delete: For example you're using a hash table with linear probing and somehow your hash function maps 3 input values to the same slot. By using linear probing you place these three elements by advancing linearly on the table until you find an empty slot. In this case if you use hard-delete for deletion, you will break the hash table since there will be an empty slot while try to retrieve a value so one value will become unreachable.
On the other hand; if you have a perfect hash function you are OK to use hard-delete. A perfect hash function maps every input value to slots uniquely. So no probing scheme is needed and hard-delete doesn't break your table.
Now coming back to your question, you should also consider and figure out how to avoid duplicate insertions.

How does the hash part in hash maps work?

So there is this nice picture in the hash maps article on Wikipedia:
Everything clear so far, except for the hash function in the middle.
How can a function generate the right index from any string? Are the indexes integers in reality too? If yes, how can the function output 1 for John Smith, 2 for Lisa Smith, etc.?
That's one of the key problems of hashmaps/dictionaries and so on. You have to choose a good hash function. A very bad but fast hash function could be the length of the keys. You instantly see, that you will get a lot of collisions (different keys, but same hash). Another bad hash function could be the ASCII value of the first character of your key. Lot's of collisions, too.
So you need a function that is a lot better than those two. You could add (xor) all ASCII values of the key characters and mix the length in for instance. In practice you often depend on the values (fields) of the object that you want to hash (same values give same hash => value type). For reference types you can mix in a memory location for instance.
In your example that's just simplified a lot. No real hash function would map these keys to sequential numbers.
Maybe you want to read one of my previous answers to hashmaps
A simple hash function may be as follows:
$hash = $string[0] % HASH_TABLE_SIZE;
This function will return a number between 0 and HASH_TABLE_SIZE - 1, depending on the first letter of the string. This number can be used to go to the correct position in the hash table.
A real hash function will consider all letters in a string, and it will be designed so that there is an even spread among the buckets.
The hash function most often (but not necessarily always) outputs an integer within wanted range (often parameter to the hash function). This integer can be used as an index. Notice that hash function cannot be guaranteed to always produce unique result when given different data to hash. This is called hash collision and hash algorithm must always handle it in some way.
As for your specific question, how a string becomes a number. Any string is composed of characters (J, o, h, n ...) and characters can be interpreted as numbers (in computers). ASCII and UTF standards bind certain values to certain characters, so result is deterministic and always the same on all computers. So the hash function does operation on these characters that processes them as numbers and comes up with another number (output). You could for example simply sum all the values and use modulo operation to range-limit the resulting value.
This would be quite a horrible hashing function because for example "ab" and "ba" would get same result. Design of hash function is difficult and so one should use some ready-made algorithm unless situation dictates some other solution.
There's a really good article on how hash functions (and colision detection/resolution) on MSDN:
Part 2: The Queue, Stack, and Hashtable
You can skip down to the header Compressing Ordinal Indexing with a Hash Function
There are some bits and pieces that are .NET specific (when they talk about which Hash algorithm .NET uses by default) but for the most part it is language agnostic.
All that is required of a hash function is that it returns the same integer given the same key. Technically, a hash function that always returns '1' is not incorrect.