How to count collisions in chaining hash (with linked list)? - hash

I have two ".txt" files, one contains a lot of strings to be inserted on a hash table and another one with strings to be searched on that hash table.
I'm making a code where I create the hash table, insert the strings of the first ".txt" in it using a hash function, and then I search each of the strings on the second ".txt" in that hash table.
The works is to display the time spent to find all the strings on the hash table (no problem), the number of found strings (no problem) and the collision count (here is the problem).
I'm using chaining hash with linked list, counting the collisions while inserting elements on the hash table.Therefore I found two ways of counting collisions, both appears to make sense to me.
First one: Once I generate the key, check if that index is NULL, if not: "collision++" once and insert the element in the end of the linked list.
Second one: Once I generate the key, check if that index is NULL, if not: "collision++" for each element that is already on the linked list while position on the linked list !NULL.
Which one is more appropriate?

Related

Key, Value, Hash and Hash function for HashTable

I'm having trouble understanding what the Hash Function does and doesn't do, as well as what exactly a Bucket is.
From my understanding:
A HashTable is a data structure that maps keys to values using a Hash Function.
A HashFunction is meant to map data from an array of arbitrary/unknown size to a data array of fixed size.
There can be duplicate Values in the original data array, but this is irrelevant.
Each Value will have a unique Key. Thus, each Key has exactly 1 Value.
The HashFunction will generate a HashCode for each (Value, Key) pair. However, Collisions can occur in which multiple (Value, Key) pairs map to the same HashCode.
This can be remedied by using either Chaining/Open Addressing methods.
The HashCode is the index value indicating the position of a particular entry from the original data array within the Bucket array.
The Bucket array is the fixed data array constructed that will contain the entries from the original array.
My questions:
How are the Keys generated for each value? Is the HashFunction meant to generate both Key and HashCode values for each entry? Does each Bucket thus contain only one entry (assuming a Chaining implementation to remedy Collision)?
How are the Keys generated for each value?
Key is not generated, it is provided by you and serves as an input to the hash function which in turn converts that key into index of hash table. Simply speaking:
H(key)=index
so the value you are looking for is:
hash_table[index] = value
Is the HashFunction meant to generate HashCode values for each entry?
It all depends on the implementation of hash function and hash table. Some hash functions might generate a hashcode out of provided key and then for example take its modulo(size) where size is the size of hash table, in order to get the index. Others might convert the key directly into index. In either case the ultimate goal of hash function is to find the location of searched data within hash table in constant time.
Does each Bucket thus contain only one entry (assuming a Chaining implementation to remedy Collision)?
Ideally each key should be mapped to a unique index but mostly that's not the case since the number of buckets (i.e. indices) is far smaller than the number of keys so the average length of a chain per bucket (i.e. number of collisions per bucket) is no.of keys/no.of indices

Complexity of insert in Hash Table

Consider an initially empty hash table of size M and hash function h(x) = x mod M. In the worst case, what is the time complexity (in Big-Oh notation) to insert n keys into the table if separate chaining is used to resolve collisions (without rehashing)? Suppose that each entry (bucket) of the table stores an unordered linked list. When adding a new element to an unordered linked list, such an element is inserted at the beginning of the list.
In the absence of collisions, inserting a key into a hash table/map is O(1), since looking up the bucket is a constant time operation. I would not expect this to vary in the case of collisions, assuming that collisions are resolved using a linked list and that the new element is inserted to the head of the list. The reason for this is that adding an new element to the head of a linked list it also basically O(1). So, inserting under these assumptions should also be O(1), and therefore inserting n keys should be O(n).

Cuckoo Hashing: What is the best way to detect collisions in hash functions?

I implemented a hashmap based on cuckoo hashing.
My hash functions take values of any length and return keys of type long. To match the keys to my array size n, I do key % n.
I'm thinking about following scenario:
Insert value A with key A.key into location A.key % n
Find value B with key A.key
So for this example I get the entry for value A and it is not recognized that value B hasn't even been inserted. This happens if my hash function returns the same key for two different values. Collisions with different keys but same locations are no problem.
What is the best way to detect those collisions?
Do I have to check every time I insert or search an item if the original values are equal?
As with most hashing schemes, in cuckoo hashing, the hash code tells you where to look in the table for the element in question, but the expectation is that you store both the key and the value in the table so that before returning the stored value, you first check the key stored at that slot against the key you're looking for. That way, if you get the same hash code for two objects, you can determine which object was stored at that slot.

Generate random composite key

I have an array of strings, and an array of dates.
I need to use these as seeds to insert a composite key into a sqlite table.
So far I am doing this:
For dates (contains dates from now to past, user selects number of days)
For name (a unique random subset of a master array)
Insert
Is there a better way of doing this (there always seem to be in perl)
use "rand($#array_name)" function, to get random index of the array you have, then just use that value, which will be different at each time.

Hashing : Insertion to a Deleted Slot

I'm new to hashing and here's my question:
Can you insert in a DELETED slot of the hash table?
Yes, you can insert to a deleted slot. But...
At first you should know that there is soft-deletion and hard-deletion. In soft-delete you just flip a flag and mark your slot as "deleted", in hard-delete you empty the slot.
Let me explain why we need soft-delete: For example you're using a hash table with linear probing and somehow your hash function maps 3 input values to the same slot. By using linear probing you place these three elements by advancing linearly on the table until you find an empty slot. In this case if you use hard-delete for deletion, you will break the hash table since there will be an empty slot while try to retrieve a value so one value will become unreachable.
On the other hand; if you have a perfect hash function you are OK to use hard-delete. A perfect hash function maps every input value to slots uniquely. So no probing scheme is needed and hard-delete doesn't break your table.
Now coming back to your question, you should also consider and figure out how to avoid duplicate insertions.