How does the hash part in hash maps work? - hash

So there is this nice picture in the hash maps article on Wikipedia:
Everything clear so far, except for the hash function in the middle.
How can a function generate the right index from any string? Are the indexes integers in reality too? If yes, how can the function output 1 for John Smith, 2 for Lisa Smith, etc.?

That's one of the key problems of hashmaps/dictionaries and so on. You have to choose a good hash function. A very bad but fast hash function could be the length of the keys. You instantly see, that you will get a lot of collisions (different keys, but same hash). Another bad hash function could be the ASCII value of the first character of your key. Lot's of collisions, too.
So you need a function that is a lot better than those two. You could add (xor) all ASCII values of the key characters and mix the length in for instance. In practice you often depend on the values (fields) of the object that you want to hash (same values give same hash => value type). For reference types you can mix in a memory location for instance.
In your example that's just simplified a lot. No real hash function would map these keys to sequential numbers.
Maybe you want to read one of my previous answers to hashmaps

A simple hash function may be as follows:
$hash = $string[0] % HASH_TABLE_SIZE;
This function will return a number between 0 and HASH_TABLE_SIZE - 1, depending on the first letter of the string. This number can be used to go to the correct position in the hash table.
A real hash function will consider all letters in a string, and it will be designed so that there is an even spread among the buckets.

The hash function most often (but not necessarily always) outputs an integer within wanted range (often parameter to the hash function). This integer can be used as an index. Notice that hash function cannot be guaranteed to always produce unique result when given different data to hash. This is called hash collision and hash algorithm must always handle it in some way.
As for your specific question, how a string becomes a number. Any string is composed of characters (J, o, h, n ...) and characters can be interpreted as numbers (in computers). ASCII and UTF standards bind certain values to certain characters, so result is deterministic and always the same on all computers. So the hash function does operation on these characters that processes them as numbers and comes up with another number (output). You could for example simply sum all the values and use modulo operation to range-limit the resulting value.
This would be quite a horrible hashing function because for example "ab" and "ba" would get same result. Design of hash function is difficult and so one should use some ready-made algorithm unless situation dictates some other solution.

There's a really good article on how hash functions (and colision detection/resolution) on MSDN:
Part 2: The Queue, Stack, and Hashtable
You can skip down to the header Compressing Ordinal Indexing with a Hash Function
There are some bits and pieces that are .NET specific (when they talk about which Hash algorithm .NET uses by default) but for the most part it is language agnostic.

All that is required of a hash function is that it returns the same integer given the same key. Technically, a hash function that always returns '1' is not incorrect.

Related

How does Snowflake calculate its HASH() output?

Take a look at this query
select
hash( col1, col2 ) as a,
col1||col2 as b, -- just taking a guess as to how hash can take multiple values
hash( b ) as c
from table_name
The result for a and c are different.
So, my question is: how does Snowflake calculate the hash when there are many fields like in a? Is it concatinating the fields first, and then signing that result of that?
Thank you
More to NickW's point that HASH is proprietary
HASH is a proprietary function that accepts a variable number of input expressions of arbitrary types and returns a signed value. It is not a cryptographic hash function and should not be used as such.
I assume the core of the problem you are trying to achieve, is to "make a value in another system, and be able to compare these "safely", of which concatenating strings together, seems very dangerous, as the number and length of each string is a property of those strings.
The usage notes section has some good hints:
Any two values of type NUMBER that compare equally will hash to the same hash value, even if the respective types have different precision and/or scale.
this implies that things are converted to this form.. but it also notes on convertion:
Note that this guarantee does not apply to other combinations of types, even if implicit conversions exist between the types.
What really would help is for you to describe, what you want to happen for you, then if "knowing how HASH works" is the best path to that end, OR not as I would suggest, would be more answerable.
Aka, this answer is a long form question, suggesting this question needs to be reworked.

Any way to get orginal data from hashed values in snowflake?

I have a table which uses the snowflake hash function to store values in some columns.
Is there any way to reverse the encrytion from the hash function and get the original values from the table?
As per the documentation, the function is not "not a cryptographic hash function", and will always return the same result for the same input expression.
Example :
select hash(1) always returns -4730168494964875235
select hash('a') always returns -947125324004678632
select hash('1234') always returns -4035663806895772878
I was wondering if there is any way to reverse the hashing and get the original input expression from the hashed values.
I think these disclaimers are for preventing potential legal disputes:
Cryptographic hash functions have a few properties which this function
does not, for example:
The cryptographic hashing of a value cannot be inverted to find the
original value.
It's not possible to reserve a hash value in general. If you consider that when you even send a very long text, and it is represented in a 64-bit value, it's obvious that the data is not preserved. On the other hand, if you use a brute force technique, you may find the actual value producing the hash, and it can be counted as reserving the hash value.
For example, if you store all hash values for the numbers between 0 and 5000 in a table, when I came with hash value '-7875472545445966613', you can look up that value in your table, and say it belongs to 1000 (number).

Cuckoo Hashing: What is the best way to detect collisions in hash functions?

I implemented a hashmap based on cuckoo hashing.
My hash functions take values of any length and return keys of type long. To match the keys to my array size n, I do key % n.
I'm thinking about following scenario:
Insert value A with key A.key into location A.key % n
Find value B with key A.key
So for this example I get the entry for value A and it is not recognized that value B hasn't even been inserted. This happens if my hash function returns the same key for two different values. Collisions with different keys but same locations are no problem.
What is the best way to detect those collisions?
Do I have to check every time I insert or search an item if the original values are equal?
As with most hashing schemes, in cuckoo hashing, the hash code tells you where to look in the table for the element in question, but the expectation is that you store both the key and the value in the table so that before returning the stored value, you first check the key stored at that slot against the key you're looking for. That way, if you get the same hash code for two objects, you can determine which object was stored at that slot.

Foreach on hash variables in Perl

I am new to Perl scripting and have a doubt on foreach on hash variables. I want to print all values of my hash. Here's a program:
%colors = (a => 1, b=>2, c=>3, d=>4, e=>5);
foreach $colors(keys %colors)
{
print "$colors{%colors} \n";
}
The output is:
5
3
1
2
4
Why are the values sorted randomly? Or what's the logic behind this randomness?? Please clarify my doubt.
I think that your confusion lies in not knowing exactly what a Hash is. Most languages have something analogous to a key-value store, in Ruby and Perl they are called Hashes, in Java Maps, in Python dictionaries, etc...
They are all essentially the same thing, you insert a value with a unique key into some underlying data structure to gain direct access to it at the cost of memory.
So what actually happens when you add a key and a value to a hash?
Hashes are built around the idea of hash functions which take some value as input computes a unique output (ideally every input has their own unique output). If two inputs both map to the same output, this is called a collision.
Now we are at the point where we need to talk about how the Hash is implemented, the two classic examples are with a single array or an array of linked-lists. I will show the array example below.
Array
In the simple array case the data structure underlying the Hash is just an array of some size. The hashing function is used to compute an index into that array. If we assume a simple hashing algorithm
h(x) = length(x) % ARRAY_SIZE
here x is a string and ARRAY_SIZE is the size of our underlying array, this statement will make sure that all values x will fall in the range 0..ARRAY_SIZE - 1
To look at a visual example consider an array of size 5:
0 1 2 3 4
------------------------------
| | | | | |
------------------------------
and assume we are trying to store the value 5 using key abcd, according to our hashing algorithm
h('abcd') = length('abcd') % ARRAY_SIZE
= 4 % 5
= 4
So the value 5 will be stored at index 4:
0 1 2 3 4
------------------------------
| | | | | 5 |
------------------------------
now what would happen if we were to try to store the value 3 using the key dcba, the two keys are different right? They should map to different places.
h('dcba') = length('dcba') % ARRAY_SIZE
= 4 % 5
= 4
Oops! This key also maps to index 4 so what are we going to do now? Well we can't just throw away the key-value pair because the programmer obviously needs/wants this pairing in their Hash so we need to decide what to do in the event of a collision. There are many algorithms that do this, but the simplest one is to look for the next open slot in the array and store 3 their. So now our array looks like:
0 1 2 3 4
------------------------------
| 3 | | | | 5 |
------------------------------
This was not an extremely in-depth explanation, but hopefully it will give some insight into why retrieving values from Hashes seems random, its because the underlying data structure constantly changes, if you were to ask for the keys from your Hash right now you would probably get back (3, 5), even though you inserted 5 first, only because 3 occurs first in the array.
Hope this was helpful.
Quoting perldata - Perl data types:
Hashes are unordered collections of scalar values indexed by their associated string key.
You can sort the keys, or, if you want to preserve the order given in the initialization, use Tie::Hash::Indexed or Tie::IxHash.
The description of keys in the perldoc has the following snippet:
Hash entries are returned in an apparently random order. The actual random order is specific to a given hash; the exact same series of operations on two hashes may result in a different order for each hash. Any insertion into the hash may change the order, as will any deletion, with the exception that the most recent key returned by each or keys may be deleted without changing the order. So long as a given hash is unmodified you may rely on keys, values and each to repeatedly return the same order as each other. See Algorithmic Complexity Attacks in perlsec for details on why hash order is randomized. Aside from the guarantees provided here the exact details of Perl's hash algorithm and the hash traversal order are subject to change in any release of Perl.
Perlsec says the following about Hash Algorithms:
Hash Algorithm - Hash algorithms like the one used in Perl are well known to be vulnerable to collision attacks on their hash function. Such attacks involve constructing a set of keys which collide into the same bucket producing inefficient behavior. Such attacks often depend on discovering the seed of the hash function used to map the keys to buckets. That seed is then used to brute-force a key set which can be used to mount a denial of service attack. In Perl 5.8.1 changes were introduced to harden Perl to such attacks, and then later in Perl 5.18.0 these features were enhanced and additional protections added.
At the time of this writing, Perl 5.18.0 is considered to be well-hardened against algorithmic complexity attacks on its hash implementation. This is largely owed to the following measures mitigate attacks:
Hash Seed Randomization
In order to make it impossible to know what seed to generate an attack key set for, this seed is randomly initialized at process start. This may be overridden by using the PERL_HASH_SEED environment variable, see PERL_HASH_SEED in perlrun. This environment variable controls how items are actually stored, not how they are presented via keys, values and each.
Hash Traversal Randomization
Independent of which seed is used in the hash function, keys, values, and each return items in a per-hash randomized order. Modifying a hash by insertion will change the iteration order of that hash. This behavior can be overridden by using hash_traversal_mask() from Hash::Util or by using the PERL_PERTURB_KEYS environment variable, see PERL_PERTURB_KEYS in perlrun. Note that this feature controls the "visible" order of the keys, and not the actual order they are stored in.
Bucket Order Perturbance
When items collide into a given hash bucket the order they are stored in the chain is no longer predictable in Perl 5.18. This has the intention to make it harder to observe a collisions. This behavior can be overridden by using the PERL_PERTURB_KEYS environment variable, see PERL_PERTURB_KEYS in perlrun.
New Default Hash Function
The default hash function has been modified with the intention of making it harder to infer the hash seed.
Alternative Hash Functions
The source code includes multiple hash algorithms to choose from. While we believe that the default perl hash is robust to attack, we have included the hash function Siphash as a fall-back option. At the time of release of Perl 5.18.0 Siphash is believed to be of cryptographic strength. This is not the default as it is much slower than the default hash.
Without compiling a special Perl, there is no way to get the exact same behavior of any versions prior to Perl 5.18.0. The closest one can get is by setting PERL_PERTURB_KEYS to 0 and setting the PERL_HASH_SEED to a known value. We do not advise those settings for production use due to the above security considerations.
Perl has never guaranteed any ordering of the hash keys, and the ordering has already changed several times during the lifetime of Perl 5. Also, the ordering of hash keys has always been, and continues to be, affected by the insertion order and the history of changes made to the hash over its lifetime.
Also note that while the order of the hash elements might be randomized, this "pseudo-ordering" should not be used for applications like shuffling a list randomly (use List::Util::shuffle() for that, see List::Util, a standard core module since Perl 5.8.0; or the CPAN module Algorithm::Numerical::Shuffle ), or for generating permutations (use e.g. the CPAN modules Algorithm::Permute or Algorithm::FastPermute ), or for any cryptographic applications.
You can use sort to print out the results in an alphabetically (as your keys are alphanumerics) sorted manner like this:
%colors = ("a" => 1, "b"=>2, "c"=>3, "d"=>4, "e"=>5);
foreach (sort keys %colors) {
print $colors{$_} . "\n";
}
Alternatively if you prefer to sort by the values:
%colors = ("a" => 1, "b"=>2, "c"=>3, "d"=>4, "e"=>5);
foreach (sort { $colors{$a} <=> $colors{$b} } keys %colors) {
print $colors{$_} . "\n";
}

Our purpose on hashing

Isn't our purpose randomness when using hashing functions? So why don't we use rand() function instead of doing operations on elements(like hashVal = 37*hashVal + key[i])?
Isn't our purpose randomness when using hashing functions?
No. Technically, our purpose when using hashing functions is to map a large data set of values called keys, to a set of values called hashes. A hash function may not be unique, that is, more than one key may map to the same hash. However, it must always map a particular key to the same hash.
For example, if hash("Hello, world") = 5, then it must always be 5, no matter how many times you hash the string. Therefore using rand() in the way you are suggesting won't work, because it would map the same key to different hashes each time.
A good hash function, however, does try to map its keys to a random hash, probabilistically. This is not the same thing as a random number. What it means, is that on an average, each hash has roughly equal number of pre-hashes. Each key, however, is still mapped to its own hash, every time.
thb's answer also illustrates this.
Good question. It depends on what one means by random.
A hash maps keys to arbitrary values -- ideally to values among which no pattern is apparent. For example:
'A' => 15
'B' => 97
'C' => 43
'D' => 60
'E' => 41
However, a hash always maps the same key to the same value. Hence:
"BED" => [97 41 60]
"BEDE" => [97 41 60 41]
Every time you give the hash an 'E', it always hashes it as 41, never as another value.
Additional note
Significant though secondary to the present discussion is that hashes need not afford unique values. For example, this is possible:
'F' => 41
Thus, given the hashed value 41, one cannot say whether the key was 'E' or 'F'.
(All this naturally suggests the question: "Fine. But what are hashes for?" That however would be another question for another time, not the question the OP has asked.)
hash function uses for mapping not for randomness number. So we can't use random function where hash uses. A hash value is always unique on a given input.
The main goal of Hash functions is to accelerate table lookup or data comparison tasks.
Difference:
key = hash(A valid input), key is deterministic output
num = random(A valid input), num is undetermined output
There are similarities between random functions and hash functions, both have a determined seed and return a value.
Hashing is returning a number based upon the object's data. So yes, it does usually provide a very large number but that number is based upon the object itself.
An identical object with the same data, depending on the hash function, will return an identical value. Because of this, hashing is not ideal for returning unique numbers as far as 'randomness' is concerned.