How does scala's VectorMap work and how is it different than ListMap? - scala

How does scala's VectorMap work? It says that it is constant time for look up.
I think ListMap has to iterate through everything to find an entry. Why would vector map be different?
Is it a hash table combined with a vector, where the hash table will map a key to an index in the vector, which has the entries?

Essentially, yes. It has a regular Map inside that maps keys to tuples (index, value), where index is pointing into a Vector of (keys), which is only used for in-order access (.head, .tail, .last, .iterator etc).

Related

Key, Value, Hash and Hash function for HashTable

I'm having trouble understanding what the Hash Function does and doesn't do, as well as what exactly a Bucket is.
From my understanding:
A HashTable is a data structure that maps keys to values using a Hash Function.
A HashFunction is meant to map data from an array of arbitrary/unknown size to a data array of fixed size.
There can be duplicate Values in the original data array, but this is irrelevant.
Each Value will have a unique Key. Thus, each Key has exactly 1 Value.
The HashFunction will generate a HashCode for each (Value, Key) pair. However, Collisions can occur in which multiple (Value, Key) pairs map to the same HashCode.
This can be remedied by using either Chaining/Open Addressing methods.
The HashCode is the index value indicating the position of a particular entry from the original data array within the Bucket array.
The Bucket array is the fixed data array constructed that will contain the entries from the original array.
My questions:
How are the Keys generated for each value? Is the HashFunction meant to generate both Key and HashCode values for each entry? Does each Bucket thus contain only one entry (assuming a Chaining implementation to remedy Collision)?
How are the Keys generated for each value?
Key is not generated, it is provided by you and serves as an input to the hash function which in turn converts that key into index of hash table. Simply speaking:
H(key)=index
so the value you are looking for is:
hash_table[index] = value
Is the HashFunction meant to generate HashCode values for each entry?
It all depends on the implementation of hash function and hash table. Some hash functions might generate a hashcode out of provided key and then for example take its modulo(size) where size is the size of hash table, in order to get the index. Others might convert the key directly into index. In either case the ultimate goal of hash function is to find the location of searched data within hash table in constant time.
Does each Bucket thus contain only one entry (assuming a Chaining implementation to remedy Collision)?
Ideally each key should be mapped to a unique index but mostly that's not the case since the number of buckets (i.e. indices) is far smaller than the number of keys so the average length of a chain per bucket (i.e. number of collisions per bucket) is no.of keys/no.of indices

Complexity of insert in Hash Table

Consider an initially empty hash table of size M and hash function h(x) = x mod M. In the worst case, what is the time complexity (in Big-Oh notation) to insert n keys into the table if separate chaining is used to resolve collisions (without rehashing)? Suppose that each entry (bucket) of the table stores an unordered linked list. When adding a new element to an unordered linked list, such an element is inserted at the beginning of the list.
In the absence of collisions, inserting a key into a hash table/map is O(1), since looking up the bucket is a constant time operation. I would not expect this to vary in the case of collisions, assuming that collisions are resolved using a linked list and that the new element is inserted to the head of the list. The reason for this is that adding an new element to the head of a linked list it also basically O(1). So, inserting under these assumptions should also be O(1), and therefore inserting n keys should be O(n).

How are same hash vs same key handled?

This question is not specific to any programming language, I am more interested in a generic logic.
Generally, associative maps take a key and map it to a value. As far as I know, implementations require the keys to be unique otherwise values get overwritten. Alright.
So let us assume that the above is done by some hash implementation.
What if two DIFFERENT keys get the same hash value? I am thinking of this in the form of an underlying array whose indices are in a result of hash on said keys. It could be possible that more than one unique key gets mapped to the same value yes? If so, how does such an implementation handle this?
How is handling same hash different from handling same key? Since same key results in overwriting and same hash HAS to retain the value.
I understand hashing with collision, so I know chaining and probing. Do implementations iterate over the current values which are hashed to a particular index and determine if the key is the same?
While I was searching for the answer I came across these links:
1. What happens when a duplicate key is put into a HashMap?
2. HashMap with multiple values under the same key
They don't answer my question however. How do we distinguish between same hash vs same key?
By comparing the keys. If you look at object-oriented implementations of hash maps, you'll find that they usually require two methods to be implemented on the key type:
bool equal(Key key1, Key key2);
int hash(Key key);
If only the hash function can be given and no equality function, that restricts the hash map to be based on the language's default equality. This is not always desirable as sometimes keys need to be compared with a different equality function. For example, if the keys are strings, an application may need to do a case-insensitive comparison, and then it would pass a hash function that converts to lowercase before hashing, and an equal function that ignores case.
The hash map stores the key alongside each corresponding value. (Usually, that's a pointer to the key object that was originally stored.) Any lookup into the hash map has to make a key comparison after finding a matching hash, to verify that the key actually matches.
For example, for a very simple hash map that stores a list in each bucket, the list would be a list of (key, value) pairs, and any lookup compares the keys of each list entry until it finds a match. In pseudocode:
Array<List<Pair<Key, Value>>> buckets;
Value lookup(Key k_sought) {
int h = hash(k_sought);
List<Pair<Key, Value>> bucket = buckets[h];
for (kv in bucket) {
Key k_found = kv.0;
Value v_found = kv.1;
if (equal(k_sought, k_found)) {
return v_found;
}
}
throw Not_found;
}
You can not tell what a key is from the index, so no you can not iterate over the values to find any information about the keys. You will either have to guarantee 0 collisions or store the information that was hashed to give the index.
If you only have values stored in your structure, there is no way to tell if they have the same key or just the same hash. You will need to store the key along with the value to know.

Cuckoo Hashing: What is the best way to detect collisions in hash functions?

I implemented a hashmap based on cuckoo hashing.
My hash functions take values of any length and return keys of type long. To match the keys to my array size n, I do key % n.
I'm thinking about following scenario:
Insert value A with key A.key into location A.key % n
Find value B with key A.key
So for this example I get the entry for value A and it is not recognized that value B hasn't even been inserted. This happens if my hash function returns the same key for two different values. Collisions with different keys but same locations are no problem.
What is the best way to detect those collisions?
Do I have to check every time I insert or search an item if the original values are equal?
As with most hashing schemes, in cuckoo hashing, the hash code tells you where to look in the table for the element in question, but the expectation is that you store both the key and the value in the table so that before returning the stored value, you first check the key stored at that slot against the key you're looking for. That way, if you get the same hash code for two objects, you can determine which object was stored at that slot.

Overriding Ordering[Int] in Scala

I'm trying to sort an array of integers with a custom ordering.
E.g.
quickSort[Int](indices)(Ordering.by[Int, Double](value(_)))
Basically, I'm trying to sort indices of rows by the values of a particular column. I end up with a stackoverflow error when I run this on a fairly large data. If I use a more direct approach (e.g. sorting Tuple), this is not a problem.
Is there a problem if you try to extend the default Ordering[Int]?
You can reproduce this like this:
val indices = (0 to 99999).toArray
val values = Array.fill[Double](100000)(math.random)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values(_))) // Works
val values2 = Array.fill[Double](100000)(0.0)
scala.util.Sorting.quickSort[Int](indices)(Ordering.by[Int, Double](values2(_))) // Fails
Update:
I think that I found out what the problem is (am answering my own question). It seems that I've created a paradoxical situation by changing the ordering definition of integers.
Within the quickSort algorithm itself, array positions are also integers, and there are certain statements comparing positions of arrays. This position comparison should be following the standard integer ordering.
But because of the new definition, now these position comparators are also following the indexed value comparator and things are getting really messed up.
I suppose that at least for the time being, I shouldn't be changing these default value type ordering as library might depend on default value type ordering.
Update2
It turns out that the above is in fact not the problem and there's a bug in quickSort when used together with Ordering. When a new Ordering is defined, the equality operator among Ordering is 'equiv', however the quickSort uses '=='. This results in the indices being compared, rather than indexed values being compared.