Any way to get orginal data from hashed values in snowflake? - hash

I have a table which uses the snowflake hash function to store values in some columns.
Is there any way to reverse the encrytion from the hash function and get the original values from the table?
As per the documentation, the function is not "not a cryptographic hash function", and will always return the same result for the same input expression.
Example :
select hash(1) always returns -4730168494964875235
select hash('a') always returns -947125324004678632
select hash('1234') always returns -4035663806895772878
I was wondering if there is any way to reverse the hashing and get the original input expression from the hashed values.

I think these disclaimers are for preventing potential legal disputes:
Cryptographic hash functions have a few properties which this function
does not, for example:
The cryptographic hashing of a value cannot be inverted to find the
original value.
It's not possible to reserve a hash value in general. If you consider that when you even send a very long text, and it is represented in a 64-bit value, it's obvious that the data is not preserved. On the other hand, if you use a brute force technique, you may find the actual value producing the hash, and it can be counted as reserving the hash value.
For example, if you store all hash values for the numbers between 0 and 5000 in a table, when I came with hash value '-7875472545445966613', you can look up that value in your table, and say it belongs to 1000 (number).

Related

How does Snowflake calculate its HASH() output?

Take a look at this query
select
hash( col1, col2 ) as a,
col1||col2 as b, -- just taking a guess as to how hash can take multiple values
hash( b ) as c
from table_name
The result for a and c are different.
So, my question is: how does Snowflake calculate the hash when there are many fields like in a? Is it concatinating the fields first, and then signing that result of that?
Thank you
More to NickW's point that HASH is proprietary
HASH is a proprietary function that accepts a variable number of input expressions of arbitrary types and returns a signed value. It is not a cryptographic hash function and should not be used as such.
I assume the core of the problem you are trying to achieve, is to "make a value in another system, and be able to compare these "safely", of which concatenating strings together, seems very dangerous, as the number and length of each string is a property of those strings.
The usage notes section has some good hints:
Any two values of type NUMBER that compare equally will hash to the same hash value, even if the respective types have different precision and/or scale.
this implies that things are converted to this form.. but it also notes on convertion:
Note that this guarantee does not apply to other combinations of types, even if implicit conversions exist between the types.
What really would help is for you to describe, what you want to happen for you, then if "knowing how HASH works" is the best path to that end, OR not as I would suggest, would be more answerable.
Aka, this answer is a long form question, suggesting this question needs to be reworked.

Key, Value, Hash and Hash function for HashTable

I'm having trouble understanding what the Hash Function does and doesn't do, as well as what exactly a Bucket is.
From my understanding:
A HashTable is a data structure that maps keys to values using a Hash Function.
A HashFunction is meant to map data from an array of arbitrary/unknown size to a data array of fixed size.
There can be duplicate Values in the original data array, but this is irrelevant.
Each Value will have a unique Key. Thus, each Key has exactly 1 Value.
The HashFunction will generate a HashCode for each (Value, Key) pair. However, Collisions can occur in which multiple (Value, Key) pairs map to the same HashCode.
This can be remedied by using either Chaining/Open Addressing methods.
The HashCode is the index value indicating the position of a particular entry from the original data array within the Bucket array.
The Bucket array is the fixed data array constructed that will contain the entries from the original array.
My questions:
How are the Keys generated for each value? Is the HashFunction meant to generate both Key and HashCode values for each entry? Does each Bucket thus contain only one entry (assuming a Chaining implementation to remedy Collision)?
How are the Keys generated for each value?
Key is not generated, it is provided by you and serves as an input to the hash function which in turn converts that key into index of hash table. Simply speaking:
H(key)=index
so the value you are looking for is:
hash_table[index] = value
Is the HashFunction meant to generate HashCode values for each entry?
It all depends on the implementation of hash function and hash table. Some hash functions might generate a hashcode out of provided key and then for example take its modulo(size) where size is the size of hash table, in order to get the index. Others might convert the key directly into index. In either case the ultimate goal of hash function is to find the location of searched data within hash table in constant time.
Does each Bucket thus contain only one entry (assuming a Chaining implementation to remedy Collision)?
Ideally each key should be mapped to a unique index but mostly that's not the case since the number of buckets (i.e. indices) is far smaller than the number of keys so the average length of a chain per bucket (i.e. number of collisions per bucket) is no.of keys/no.of indices

Cuckoo Hashing: What is the best way to detect collisions in hash functions?

I implemented a hashmap based on cuckoo hashing.
My hash functions take values of any length and return keys of type long. To match the keys to my array size n, I do key % n.
I'm thinking about following scenario:
Insert value A with key A.key into location A.key % n
Find value B with key A.key
So for this example I get the entry for value A and it is not recognized that value B hasn't even been inserted. This happens if my hash function returns the same key for two different values. Collisions with different keys but same locations are no problem.
What is the best way to detect those collisions?
Do I have to check every time I insert or search an item if the original values are equal?
As with most hashing schemes, in cuckoo hashing, the hash code tells you where to look in the table for the element in question, but the expectation is that you store both the key and the value in the table so that before returning the stored value, you first check the key stored at that slot against the key you're looking for. That way, if you get the same hash code for two objects, you can determine which object was stored at that slot.

Hashing of timestamp

I need a hash function(maybe I should not call that a "hash" function) that:
1.is used for hashing timestamps only;
2.there exist a reverse function that I can restore the timestamp through that function;
3.does not generate duplicate hash value;
4.whether not it is a hash function, it is nearly as fast as a hash function;
PS: About the data type of timestamp --- image that as a 4 bytes "long" type in C.
Is that possible?
(I need the timestamp to be a secret. --- In fact, I need the hash value as a session id and the original timestamp as an index in my database. Whenever user request something with the session id, I can get the timestamp as an index to get the request info.)
If you can skip #2 MurmurHash might be a good option:
https://sites.google.com/site/murmurhash/
(2) If you must crypt/decrypt there are standard implementations of the various algorithms for most languages (AES, for instance). This will be much slower than hashing.
If you don't actually need this to secure the data (which begs the question: why bother at all with any conversion?) and just want to make some non-timestamp-looking string that is easily reversible (by you -- and anyone else) then check this question:
Rot13 for numbers

How does the hash part in hash maps work?

So there is this nice picture in the hash maps article on Wikipedia:
Everything clear so far, except for the hash function in the middle.
How can a function generate the right index from any string? Are the indexes integers in reality too? If yes, how can the function output 1 for John Smith, 2 for Lisa Smith, etc.?
That's one of the key problems of hashmaps/dictionaries and so on. You have to choose a good hash function. A very bad but fast hash function could be the length of the keys. You instantly see, that you will get a lot of collisions (different keys, but same hash). Another bad hash function could be the ASCII value of the first character of your key. Lot's of collisions, too.
So you need a function that is a lot better than those two. You could add (xor) all ASCII values of the key characters and mix the length in for instance. In practice you often depend on the values (fields) of the object that you want to hash (same values give same hash => value type). For reference types you can mix in a memory location for instance.
In your example that's just simplified a lot. No real hash function would map these keys to sequential numbers.
Maybe you want to read one of my previous answers to hashmaps
A simple hash function may be as follows:
$hash = $string[0] % HASH_TABLE_SIZE;
This function will return a number between 0 and HASH_TABLE_SIZE - 1, depending on the first letter of the string. This number can be used to go to the correct position in the hash table.
A real hash function will consider all letters in a string, and it will be designed so that there is an even spread among the buckets.
The hash function most often (but not necessarily always) outputs an integer within wanted range (often parameter to the hash function). This integer can be used as an index. Notice that hash function cannot be guaranteed to always produce unique result when given different data to hash. This is called hash collision and hash algorithm must always handle it in some way.
As for your specific question, how a string becomes a number. Any string is composed of characters (J, o, h, n ...) and characters can be interpreted as numbers (in computers). ASCII and UTF standards bind certain values to certain characters, so result is deterministic and always the same on all computers. So the hash function does operation on these characters that processes them as numbers and comes up with another number (output). You could for example simply sum all the values and use modulo operation to range-limit the resulting value.
This would be quite a horrible hashing function because for example "ab" and "ba" would get same result. Design of hash function is difficult and so one should use some ready-made algorithm unless situation dictates some other solution.
There's a really good article on how hash functions (and colision detection/resolution) on MSDN:
Part 2: The Queue, Stack, and Hashtable
You can skip down to the header Compressing Ordinal Indexing with a Hash Function
There are some bits and pieces that are .NET specific (when they talk about which Hash algorithm .NET uses by default) but for the most part it is language agnostic.
All that is required of a hash function is that it returns the same integer given the same key. Technically, a hash function that always returns '1' is not incorrect.