using hash key in merge statement as comparison column - hash

I have to implement merge statement in snowflake DB. There will be more than 6 billion rows in Target table. There are multiple columns in comparison around 20. I was thinking to generate hash key using HASH function on the basis of all 20 columns in Snowflake.
But i read the document hash where it is mentioned that after 4 billion rows it is likely to get duplicate hash key. Is my understanding is correct.?
So should i avoid hash key for comparing records and use all the columns instead?
or can use md5 of hexa 128 bit or any customized hash function. Kindly suggest.

TL;DR version of this: With your number of rows, using the HASH function gives you a 62% chance that two rows' hash values will collide. Using MD5 instead of HASH will reduce your chances of a collision to a tiny fraction of a percent. On the same size warehouse, it will require about 24% more time to calculate the MD5s instead of the hashes. Recommendations: If very rare collisions are not tolerable, match on either 1) MD5 or 2) hash and column compare. Option 2 will be faster, but will require more maintenance if the schema changes over time.
The topic of using hashes in merges merits its own position paper, and I'm sure many have been written. We'll focus on your specifics and how Snowflake responds.
Let's start with the section of the docs you reference. When unique inputs lead to identical hash values, it's called a collision. It's a classic problem in mathematics and computation known as the Birthday Problem (https://en.wikipedia.org/wiki/Birthday_problem). There's a ton of writing on the subject, so we'll stick to what's relevant to your situation.
If you use a 64-bit hash on your 6 billion row table the probability of a collision is about 62%. Still, it's a manageable problem as we'll explore later.
If you use a 128-bit hash such as MD5 on 6 billion inputs the probability rounds to zero. Even if your table grows to 1000 times as many rows the probability of a collision would be 0.0000000000053%.
While superficially that seems to get around the collision problem, it introduces a new problem. The MD5 function is more computationally expensive than the hash function. We can determine how much through some simple tests on a Medium sized warehouse.
select count(md5((concat(*))))
from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10000"."ORDERS"; -- 18m 41s
select count(hash((concat(*))))
from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10000"."ORDERS"; -- 15m 6s
I used count to eliminate the results collection time, but it's still calculating the MD5s and hashes. MD5 takes ~24% longer to calculate than hash.
This brings us to the final part of the TL;DR discussion, using the hash function and column compares. This is the faster option, and the only one that guarantees no collisions. It's faster because this operation in pseudo code:
condition1 AND condition2
In this expression, if the first part fails there's no need to test the second part. I haven't tested this experimentally (yet) in a Snowflake merge match clause, but see no reason it would test the second part of the expression when matching rows. That way practically all matches would be handled quickly by comparing the hash, and only very rare cases would also have to test the columns matching.
One final thought: The fewer rows you merge each time relative to the size of the table will make the extra computation time for the MD5 less important. You've already "paid" for the extra computation time for the MD5 values sitting in the table. If you're merging a few thousand rows, the 24% extra time to calculate MD5 is inconsequential and saves you from having to maintain a column list in your match clause.

Related

Don't you get a random number after doing modulo on a hashed number?

I'm trying to understand hash tables, and from what I've seen the modulo operator is used to select which bucket a key will be placed in. I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation. Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
I'm sure I'm misunderstanding this. Can someone explain?
Don't you get a random number after doing modulo on a hashed number?
It depends on the hash function.
Say you have an identify hash for numbers - h(n) = n - then if the keys being hashed are generally incrementing numbers (perhaps with an occasional ommision), then after hashing they'll still generally hit successive buckets (wrapping at some point from the last bucket back to the first), with low collision rates overall. Not very random, but works out well enough. If the keys are random, it still works out pretty well - see the discussion of random-but-repeatable hashing below. The problem is when the keys are neither roughly-incrementing nor close-to-random - then an identity hash can provide terrible collision rates. (You might think "this is a crazy bad example hash function, nobody would do this; actually, most C++ Standard Library implementations' hash functions for integers are identity hashes).
On the other hand, if you have a hash function that say takes the address of the object being hashed, and they're all 8 byte aligned, then if you take the mod and the bucket count is also a multiple of 8, you'll only ever hash to every 8th bucket, having 8 times more collisions than you might expect. Not very random, and doesn't work out well. But, if the number of buckets is a prime, then the addresses will tend to scatter much more randomly over the buckets, and things will work out much better. This is the reason the GNU C++ Standard Library tends to use prime numbers of buckets (Visual C++ uses power-of-two sized buckets so it can utilise a bitwise AND for mapping hash values to buckets, as AND takes one CPU cycle and MOD can take e.g. 30-40 cycles - depending on your exact CPU - see here).
When all the inputs are known at compile time, and there's not too many of them, then it's generally possible to create a perfect hash function (GNU gperf software is designed specifically for this), which means it will work out a number of buckets you'll need and a hash function that avoids any collisions, but the hash function may take longer to run than a general purpose function.
People often have a fanciful notion - also seen in the question - that a "perfect hash function" - or at least one that has very few collisions - in some large numerical hashed-to range will provide minimal collisions in actual usage in a hash table, as indeed this stackoverflow question is about coming to grips with the falsehood of this notion. It's just not true if there are still patterns and probabilities in the way the keys map into that large hashed-to range.
The gold standard for a general purpose high-quality hash function for runtime inputs is to have a quality that you might call "random but repeatable", even before the modulo operation, as that quality will apply to the bucket selection as well (even using the dumber and less forgiving AND bit-masking approach to bucket selection).
As you've noticed, this does mean you'll see collisions in the table. If you can exploit patterns in the keys to get less collisions that this random-but-repeatable quality would give you, then by all means make the most of that. If not, the beauty of hashing is that with random-but-repeatable hashing your collisions are statistically related to your load factor (the number of stored elements divided by the number of buckets).
As an example, for separate chaining - when your load factor is 1.0, 1/e (~36.8%) of buckets will tend to be empty, another 1/e (~36.8%) have one element, 1/(2e) or ~18.4% two elements, 1/(3!e) about 6.1% three elements, 1/(4!e) or ~1.5% four elements, 1/(5!e) ~.3% have five etc.. - the average chain length from non-empty buckets is ~1.58 no matter how many elements are in the table (i.e. whether there are 100 elements and 100 buckets, or 100 million elements and 100 million buckets), which is why we say lookup/insert/erase are O(1) constant time operations.
I know that hash algorithms are supposed to minimize the same result for different inputs, however I don't understand how the same results for different inputs can be minimal after the modulo operation.
This is still true post-modulo. Minimising the same result means each post-modulo value has (about) the same number of keys mapping to it. We're particularly concerned about in-use keys stored in the table, if there's a non-uniform statistical distribution to the use of keys. With a hash function that exhibits the random-but-repeatable quality, there will be random variation in post-modulo mapping, but overall they'll be close enough to evenly balanced for most practical purposes.
Just to recap, let me address this directly:
Let's just say we have a near-perfect hash function that gives a different hashed value between 0 and 100,000, and then we take the result modulo 20 (in our example we have 20 buckets), isn't the resulting number very close to a random number between 0 and 19? Meaning roughly the probability that the final result is any of a number between 0 and 19 is about 1 in 20? If this is the case, then the original hash function doesn't seem to ensure minimal collisions because after the modulo operation we end up with something like a random number? I must be wrong, but I'm thinking that what ensures minimal collisions the most is not the original hash function but how many buckets we have.
So:
random is good: if you get something like the random-but-repeatable hash quality, then your average hash collisions will statistically be capped at low levels, and in practice you're unlikely to ever see a particularly horrible collision chain, provided you keep the load factor reasonable (e.g. <= 1.0)
that said, your "near-perfect hash function...between 0 and 100,000" may or may not be high quality, depending on whether the distribution of values has patterns in it that would produce collisions. When in doubt about such patterns, use a hash function with the random-but-repeatable quality.
What would happen if you took a random number instead of using a hash function? Then doing the modulo on it? If you call rand() twice you can get the same number - a proper hash function doesn't do that I guess, or does it? Even hash functions can output the same value for different input.
This comment shows you grappling with the desirability of randomness - hopefully with earlier parts of my answer you're now clear on this, but anyway the point is that randomness is good, but it has to be repeatable: the same key has to produce the same pre-modulo hash so the post-modulo value tells you the bucket it should be in.
As an example of random-but-repeatable, imagine you used rand() to populate a uint32_t a[256][8] array, you could then hash any 8 byte key (e.g. including e.g. a double) by XORing the random numbers:
auto h(double d) {
uint8_t i[8];
memcpy(i, &d, 8);
return a[i[0]] ^ a[i[1]] ^ a[i[2]] ^ ... ^ a[i[7]];
}
This would produce a near-ideal (rand() isn't a great quality pseudo-random number generator) random-but-repeatable hash, but having a hash function that needs to consult largish chunks of memory can easily be slowed down by cache misses.
Following on from what [Mureinik] said, assuming you have a perfect hash function, say your array/buckets are 75% full, then doing modulo on the hashed function will probably result in a 75% collision probability. If that's true, I thought they were much better. Though I'm only learning about how they work now.
The 75%/75% thing is correct for a high quality hash function, assuming:
closed hashing / open addressing, where collisions are handled by finding an alternative bucket, or
separate chaining when 75% of buckets have one or more elements linked therefrom (which is very likely to mean the load factor (which many people may think of when you talk about how "full" the table is) is already significantly more than 75%)
Regarding "I thought they were much better." - that's actually quite ok, as evidenced by the percentages of colliding chain lengths mentioned earlier in my answer.
I think you have the right understanding of the situation.
Both the hash function and the number of buckets affect the chance of collisions. Consider, for example, the worst possible hash function - one that returns a constant value. No matter how many buckets you have, all the entries will be lumped to the same bucket, and you'd have a 100% chance of collision.
On the other hand, if you have a (near) perfect hash function, the number of buckets would be the main factor for the chance of collision. If your hash table has only 20 buckets, the minimal chance of collision will indeed be 1 in 20 (over time). If the hash values weren't uniformly spread, you'd have a much higher chance of collision in at least one of the buckets. The more buckets you have, the less chance of collision. On the other hand, having too many buckets will take up more memory (even if they are empty), and ultimately reduce performance, even if there are less collisions.

Fast long calculation

I have the following task.
I have 1 billion or more 20-bytes distinct hashes (stored in some database) which total number
is less than Java's Long.MAX_VALUE;
After that I have almost infinite stream of such hashes.
Is there possibility to create some bijective mapping from the set of these 20-bytes distinct hashes
to the set of numbers between 0 and Long.MAX_VALUE ?
Kind of Lagrange polynomial calculation - but may be there is something really fast and effective for such case.
We need fast long value calculation for each hash from this almost infinite stream.
Each 20 - bytes hash is just a number.
Before stream's processing we can create mapping
20-byte | 8-byte
(hash1 1)
....
(hashN N)
After that when we have next hash from the infinite stream we will obtain 8-byte hash value without lookups using only arithmetical calculations.
Since you gave no practical constraints on size or storage beyond "It has to be fast", I am going to assume you can take your time to pre-process the set of hashes in order to "make it fast". I am further assuming the hashes are distributed randomly and that the mapping to 8-byte numbers is likewise unpredictable.
My first approach would be a local SQLite database. That allows you to use its native BTree indexing to quickly retrieve results. With a large enough page size you can store 256 pointers per BTree node for an expected amount of log_256(10^9)= 3.737169106748283 disk seeks per lookup. This will improve as more of your BTree structures get cached.
Second approach, if you have the memory for it: in-memory BTree.
Would it work something like this?
aNextHash = Stream.getHash();
long aValue = aNextHash % Long.MAX_VALUE;

The hash table probability

I still confuse how to find hash table probability. I have hash table of size 20 with open addressing uses the hash function
hash(int x) = x % 20
How many elements need to be inserted in the hash table so that the probability of the next element hitting a collision exceeds 50%.
I use birthday paradox concerns to find it https://en.wikipedia.org/wiki/Birthday_problem and seems get an incorrect answer. Where is my mistake?
calculating
1/2=1-e^(-n^2/(2*20))
ln(1/2)=ln(e)*(-n^2/40)
-0.69314718=-n^2/40
n=scr(27.725887)=5.265538
How many elements need to be inserted in the hash table so that the probability of the next element hitting a collision exceeds 50%.
Well, it depends on a few things.
The simple case is that you've already performed 11 inserts with distinct and effectively random integer keys, such that 11 of the buckets are in use, and your next insertion uses another distinct and effectively random key so it will hash to any bucket with equal probability: clearly there's only a 9/20 chance of that bucket being unused which means your chance of a collision during that 12th insertion exceeds 50% for the first time. This is the answer most formulas, textbooks, people etc. will give you, as it's the most meaningful for situations where hash tables are used with strong hash functions and/or prime numbers of buckets etc. - the scenarios where hash tables shine and are particularly elegant.
Another not-uncommon scenario is that you're putting say customer ids for a business into the hash table, and you're assigning the customers incrementing id numbers starting at 1. Even if you've already inserted customers with ids 1 to 19, you know they're in buckets [1] to [19] with no collisions - your hash just passes the keys through without the mod kicking in. You can then insert customer 20 into bucket [0] (after the mod operation) without a collision. Then, the 21st customer has 100% chance of a collision. (But, if your data's like this, please use an array and index directly using the customer id, or customer_id - 1 if you don't want to waste bucket [0].)
There are many other possible patterns in the keys that can affect when you exceed a 50% probability of a collision: e.g. all the keys being odd or multiples of some value, or being say ages or heights with a particular distribution curve.
The mistake with your use of the Birthday Paradox is thinking it answers your question. When you put "1/2" and "20" into the formula, it's telling you that the point at which your cumulative probability of a collision reaches 1/2, but your question is "the probability of the next element hitting a collision exceeds 50%" (emphasis mine).

Is there a solution to creating a perfect hash table for non-finite inputs?

So hash tables are really cool for constant-time lookups of data in sets, but as I understand they are limited by possible hashing collisions which leads to increased small amounts of time-complexity.
It seems to me like any hashing function that supports a non-finite range of inputs is really a heuristic for reducing collision. Are there any absolute limitations to creating a perfect hash table for any range of inputs, or is it just something that no one has figured out yet?
I think this depends on what you mean by "any range of inputs."
If your goal is to create a hash function that can take in anything and never produce a collision, then there's no way to do what you're asking. This is a consequence of the pigeonhole principle - if you have n objects that can be hashed, you need at least n distinct outputs for your hash function or you're forced to get at least one hash collision. If there are infinitely many possible input objects, then no finite hash table could be built that will always avoid collisions.
On the other hand, if your goal is to build a hash table where lookups are worst-case O(1) (that is, you only have to look at a fixed number of locations to find any element), then there are many different options available. You could use a dynamic perfect hash table or a cuckoo hash table, which supports worst-case O(1) lookups and expected O(1) insertions and deletions. These hash tables work by using a variety of different hash functions rather than any one fixed hash function, which helps circumvent the above restriction.
Hope this helps!

Wikipedia rainbow tables entry

Wikipedia page for rainbow tables says:
"this use of multiple reduction functions approximately doubles the speed of lookups."
Assuming the "Average" position in the chain, we take a hash and run it through a 9 iteration chain...
The original table runs it through 4 reductions and 4 hashes and finds the end of the chain, then looks it up for another 5 hashes 5 reductions... total 9 hashes 9 reductions
The rainbow table runs it through Rk-1, Rk-2, Rk-3, and Rk-4 calculations to find the end of the chain, then another 5 hashes 5 reductions to get the plaintext: total 15 hashes 15 reductions...
What am I missing here? By my math the only time a rainbow lookup is even the same speed as a normal table is when the hash just happens to be at the very end of the chain... In fact the RT should be incrementally slower the further towards the beginning the hash lies...
A 5k chain with the hash at the beginning should be approx 2500 times slower with rainbow tables than with normal hash tables...
Am I missing something or did Wikipedia make a mistake? (The paper referenced on that page (Page 13) would also be wrong, so I'm leaning towards the former)
The purpose of rainbow tables isn't to necessarily be faster but to reduce space.
Rainbow tables trade speed for size.
Storing hashes for all possible 10 digit passwords for example would be prohibitively expensive in terms of disk space. Also you need to consider that since the dictionary space is so large it will require significant paging (very slow operation).
Rainbow tables are more CPU intensive but they are much much much smaller requiring less disk space and also allowing more of the potential dictionary space in memory at one time. Keep in mind that means in the real world higher potential performance on large dictionary spaces due to less paging (disk reads are prohibitively slow).
Here is a better illustrated example:
http://kestas.kuliukas.com/RainbowTables/
Of course this is all academic. Rainbow tables provide no value against well designed security systems.
1) Use a cryptographically secure algorithm (no "roll your own")
2) Use a key derivation function (with thousands of iterations) to slow attackers hash throughput.
3) Use large (32 to 64 bit) random salt. Rainbow tables can no longer be precomputed, nor can that computation be used for any other system (unless they happen to share same salt.
4) If possible use different salt per record thus making rainbow table completely invalid.
All the answers are in the original paper. First of all, you must see that you must compare a single rainbow table with t classical tables, t being the number of elements in a chain. Indeed, each column in the rainbow table acts like a single classical table (e.g. if you have to identical elements in a column of a rainbow table, you will have a merge, if you have two identical elements in a classical table you also have a merge).
Then you see that for searching in t classical tables you would need t^2 operations if you have to go through all the tables (t tables with chains of length t). If you search in the single rainbow table you will need 1+2+3+...+t operations which is equal to t^2/2. So in the worst case, where you don't find the password you will be two times faster. Now if the password shows up in average after you have gone through half of the tables or columns then it will be 4 times faster. If you want a high probability of success (e.g. 99%) then in average a password would already show up after 10% of the table, making rainbow tables 20x faster.