Wikipedia page for rainbow tables says:
"this use of multiple reduction functions approximately doubles the speed of lookups."
Assuming the "Average" position in the chain, we take a hash and run it through a 9 iteration chain...
The original table runs it through 4 reductions and 4 hashes and finds the end of the chain, then looks it up for another 5 hashes 5 reductions... total 9 hashes 9 reductions
The rainbow table runs it through Rk-1, Rk-2, Rk-3, and Rk-4 calculations to find the end of the chain, then another 5 hashes 5 reductions to get the plaintext: total 15 hashes 15 reductions...
What am I missing here? By my math the only time a rainbow lookup is even the same speed as a normal table is when the hash just happens to be at the very end of the chain... In fact the RT should be incrementally slower the further towards the beginning the hash lies...
A 5k chain with the hash at the beginning should be approx 2500 times slower with rainbow tables than with normal hash tables...
Am I missing something or did Wikipedia make a mistake? (The paper referenced on that page (Page 13) would also be wrong, so I'm leaning towards the former)
The purpose of rainbow tables isn't to necessarily be faster but to reduce space.
Rainbow tables trade speed for size.
Storing hashes for all possible 10 digit passwords for example would be prohibitively expensive in terms of disk space. Also you need to consider that since the dictionary space is so large it will require significant paging (very slow operation).
Rainbow tables are more CPU intensive but they are much much much smaller requiring less disk space and also allowing more of the potential dictionary space in memory at one time. Keep in mind that means in the real world higher potential performance on large dictionary spaces due to less paging (disk reads are prohibitively slow).
Here is a better illustrated example:
http://kestas.kuliukas.com/RainbowTables/
Of course this is all academic. Rainbow tables provide no value against well designed security systems.
1) Use a cryptographically secure algorithm (no "roll your own")
2) Use a key derivation function (with thousands of iterations) to slow attackers hash throughput.
3) Use large (32 to 64 bit) random salt. Rainbow tables can no longer be precomputed, nor can that computation be used for any other system (unless they happen to share same salt.
4) If possible use different salt per record thus making rainbow table completely invalid.
All the answers are in the original paper. First of all, you must see that you must compare a single rainbow table with t classical tables, t being the number of elements in a chain. Indeed, each column in the rainbow table acts like a single classical table (e.g. if you have to identical elements in a column of a rainbow table, you will have a merge, if you have two identical elements in a classical table you also have a merge).
Then you see that for searching in t classical tables you would need t^2 operations if you have to go through all the tables (t tables with chains of length t). If you search in the single rainbow table you will need 1+2+3+...+t operations which is equal to t^2/2. So in the worst case, where you don't find the password you will be two times faster. Now if the password shows up in average after you have gone through half of the tables or columns then it will be 4 times faster. If you want a high probability of success (e.g. 99%) then in average a password would already show up after 10% of the table, making rainbow tables 20x faster.
Related
I have to implement merge statement in snowflake DB. There will be more than 6 billion rows in Target table. There are multiple columns in comparison around 20. I was thinking to generate hash key using HASH function on the basis of all 20 columns in Snowflake.
But i read the document hash where it is mentioned that after 4 billion rows it is likely to get duplicate hash key. Is my understanding is correct.?
So should i avoid hash key for comparing records and use all the columns instead?
or can use md5 of hexa 128 bit or any customized hash function. Kindly suggest.
TL;DR version of this: With your number of rows, using the HASH function gives you a 62% chance that two rows' hash values will collide. Using MD5 instead of HASH will reduce your chances of a collision to a tiny fraction of a percent. On the same size warehouse, it will require about 24% more time to calculate the MD5s instead of the hashes. Recommendations: If very rare collisions are not tolerable, match on either 1) MD5 or 2) hash and column compare. Option 2 will be faster, but will require more maintenance if the schema changes over time.
The topic of using hashes in merges merits its own position paper, and I'm sure many have been written. We'll focus on your specifics and how Snowflake responds.
Let's start with the section of the docs you reference. When unique inputs lead to identical hash values, it's called a collision. It's a classic problem in mathematics and computation known as the Birthday Problem (https://en.wikipedia.org/wiki/Birthday_problem). There's a ton of writing on the subject, so we'll stick to what's relevant to your situation.
If you use a 64-bit hash on your 6 billion row table the probability of a collision is about 62%. Still, it's a manageable problem as we'll explore later.
If you use a 128-bit hash such as MD5 on 6 billion inputs the probability rounds to zero. Even if your table grows to 1000 times as many rows the probability of a collision would be 0.0000000000053%.
While superficially that seems to get around the collision problem, it introduces a new problem. The MD5 function is more computationally expensive than the hash function. We can determine how much through some simple tests on a Medium sized warehouse.
select count(md5((concat(*))))
from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10000"."ORDERS"; -- 18m 41s
select count(hash((concat(*))))
from "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF10000"."ORDERS"; -- 15m 6s
I used count to eliminate the results collection time, but it's still calculating the MD5s and hashes. MD5 takes ~24% longer to calculate than hash.
This brings us to the final part of the TL;DR discussion, using the hash function and column compares. This is the faster option, and the only one that guarantees no collisions. It's faster because this operation in pseudo code:
condition1 AND condition2
In this expression, if the first part fails there's no need to test the second part. I haven't tested this experimentally (yet) in a Snowflake merge match clause, but see no reason it would test the second part of the expression when matching rows. That way practically all matches would be handled quickly by comparing the hash, and only very rare cases would also have to test the columns matching.
One final thought: The fewer rows you merge each time relative to the size of the table will make the extra computation time for the MD5 less important. You've already "paid" for the extra computation time for the MD5 values sitting in the table. If you're merging a few thousand rows, the 24% extra time to calculate MD5 is inconsequential and saves you from having to maintain a column list in your match clause.
I've heard quite a couple times people talking about KDB deal with millions of rows in nearly no time. why is it that fast? is that solely because the data is all organized in memory?
another thing is that is there alternatives for this? any big database vendors provide in memory databases ?
A quick Google search came up with the answer:
Many operations are more efficient with a column-oriented approach. In particular, operations that need to access a sequence of values from a particular column are much faster. If all the values in a column have the same size (which is true, by design, in kdb), things get even better. This type of access pattern is typical of the applications for which q and kdb are used.
To make this concrete, let's examine a column of 64-bit, floating point numbers:
q).Q.w[] `used
108464j
q)t: ([] f: 1000000 ? 1.0)
q).Q.w[] `used
8497328j
q)
As you can see, the memory needed to hold one million 8-byte values is only a little over 8MB. That's because the data are being stored sequentially in an array. To clarify, let's create another table:
q)u: update g: 1000000 ? 5.0 from t
q).Q.w[] `used
16885952j
q)
Both t and u are sharing the column f. If q organized its data in rows, the memory usage would have gone up another 8MB. Another way to confirm this is to take a look at k.h.
Now let's see what happens when we write the table to disk:
q)`:t/ set t
`:t/
q)\ls -l t
"total 15632"
"-rw-r--r-- 1 kdbfaq staff 8000016 May 29 19:57 f"
q)
16 bytes of overhead. Clearly, all of the numbers are being stored sequentially on disk. Efficiency is about avoiding unnecessary work, and here we see that q does exactly what needs to be done when reading and writing a column - no more, no less.
OK, so this approach is space efficient. How does this data layout translate into speed?
If we ask q to sum all 1 million numbers, having the entire list packed tightly together in memory is a tremendous advantage over a row-oriented organization, because we'll encounter fewer misses at every stage of the memory hierarchy. Avoiding cache misses and page faults is essential to getting performance out of your machine.
Moreover, doing math on a long list of numbers that are all together in memory is a problem that modern CPU instruction sets have special features to handle, including instructions to prefetch array elements that will be needed in the near future. Although those features were originally created to improve PC multimedia performance, they turned out to be great for statistics as well. In addition, the same synergy of locality and CPU features enables column-oriented systems to perform linear searches (e.g., in where clauses on unindexed columns) faster than indexed searches (with their attendant branch prediction failures) up to astonishing row counts.
Sources(S): http://www.kdbfaq.com/kdb-faq/tag/why-kdb-fast
as for speed, the memory thing does play a big part but there are several other things, fast read from disk for hdb, splaying etc. From personal experienoce I can say, you can get pretty good speeds from c++ provided you want to write that much code. With kdb you get all that and some more.
another thing about speed is also speed of coding. Steep learning curve but once you get it, complex problems can be coded in minutes.
alternatives you can look at onetick or google in memory databases
kdb is fast but really expensive. Plus, it's a pain to learn Q. There are a few alternatives such as DolphinDB, Quasardb, etc.
I would like to store hashes for approximately 2 billion strings. For that purpose I would like to use as less storage as possible.
Consider an ideal hashing algorithm which returns hash as series of hexadecimal digits (like an md5 hash).
As far as i understand the idea this means that i need hash to be not less and not more than 8 symbols in length. Because such hash would be capable of hashing 4+ billion (16 * 16 * 16 * 16 * 16 * 16 * 16 * 16) distinct strings.
So I'd like to know whether it is it safe to cut hash to a certain length to save space ?
(hashes, of course, should not collide)
Yes/No/Maybe - i would appreciate answers with explanations or links to related studies.
P.s. - i know i can test whether 8-character hash would be ok to store 2 billion strings. But i need to compare 2 billion hashes with their 2 billion cutted versions. It doesn't seem trivial to me so i'd better ask before i do that.
The hash is a number, not a string of hexadecimal numbers (characters). In case of MD5, it is 128 bits or 16 bytes saved in efficient form. If your problem still applies, you sure can consider truncating the number (by either coersing into a word or first bitshifting by). Good hash algorithms distribute evenly to all bits.
Addendum:
Generally whenever you deal with hashes, you want to check if the strings really match. This takes care of the possibility of collising hashes. The more you cut the hash the more collisions you're going to get. But it's good to plan for that happening at this phase.
Whether or not its safe to store x values in a hash domain only capable of representing 2x distinct hash values depends entirely on whether you can tolerate collisions.
Hash functions are effectively random number generators, so your 2 billion calculated hash values will be distributed evenly about the 4 billion possible results. This means that you are subject to the Birthday Problem.
In your case, if you calculate 2^31 (2 billion) hashes with only 2^32 (4 billion) possible hash values, the chance of at least two having the same hash (a collision) is very, very nearly 100%. (And the chance of three being the same is also very, very nearly 100%. And so on.) I can't find the formula for calculating the probable number of collisions based on these numbers, but I suspect it is a huge number.
If in your case hash collisions are not a disaster (such as in Java's HashMap implementation which deals with collisions by turning the hash target into a list of objects which share the same hash key, albeit at the cost of reduced performance) then maybe you can live with the certainty of a high number of collisions. But if you need uniqueness then you need either a far, far larger hash domain, or you need to assign each record a guaranteed-unique serial ID number, depending on your purposes.
Finally, note that Keccak is capable of generating any desired output length, so it makes little sense to spend CPU resources generating a long hash output only to trim it down afterwards. You should be able to tell your Keccak function to give only the number of bits you require. (Also note that a change in Keccak output length does not affect the initial output bits, so the result will be exactly the same as if you did a manual bitwise trim afterwards.)
I am parsing a large amount of network trace data. I want to split the trace into chunks, hash each chunk, and store a sequence of the resulting hashes rather than the original chunks. The purpose of my work is to identify identical chunks of data - I'm hashing the original chunks to reduce the data set size for later analysis. It is acceptable in my work that we trade off the possibility that collisions occasionally occur in order to reduce the hash size (e.g. 40 bit hash with 1% misidentification of identical chunks might beat 60 bit hash with 0.001% misidentification).
My question is, given a) number of chunks to be hashed and b) allowable percentage of misidentification, how can one go about choosing an appropriate hash size?
As an example:
1,000,000 chunks to be hashed, and we're prepared to have 1% misidentification (1% of hashed chunks appear identical when they are not identical in the original data). How do we choose a hash with the minimal number of bits that satisifies this?
I have looked at materials regarding the Birthday Paradox, though this is concerned specifically with the probability of a single collision. I have also looked at materials which discuss choosing a size based on an acceptable probability of a single collision, but have not been able to extrapolate from this how to choose a size based on an acceptable probability of n (or fewer) collisions.
Obviously, the quality of your hash function matters, but some easy probability theory will probably help you here.
The question is what exactly are you willing to accept, is it good enough that you have an expected number of collisions at only 1% of the data? Or, do you demand that the probability of the number of collisions going over some bound be something? If its the first, then back of the envelope style calculation will do:
Expected number of pairs that hash to the same thing out of your set is (1,000,000 C 2)*P(any two are a pair). Lets assume that second number is 1/d where d is the the size of the hashtable. (Note: expectations are linear, so I'm not cheating very much so far). Now, you say you want 1% collisions, so that is 10000 total. Well, you have (1,000,000 C 2)/d = 10,000, so d = (1,000,000 C 2)/10,000 which is according to google about 50,000,000.
So, you need a 50 million ish possible hash values. That is a less than 2^26, so you will get your desired performance with somewhere around 26 bits of hash (depending on quality of hashing algorithm). I probably have a factor of 2 mistake in there somewhere, so you know, its rough.
If this is an offline task, you cant be that space constrained.
Sounds like a fun exercise!
Someone else might have a better answer, but I'd go the brute force route, provided that there's ample time:
Run the hashing calculation using incremental hash size and record the collision percentage for each hash size.
You might want to use binary search to reduce the search space.
I am working on a system where hash collisions would be a problem. Essentially there is a system that references items in a hash-table+tree structure. However the system in question first compiles text-files containing paths in the structure into a binary file containing the hashed values instead. This is done for performance reasons. However because of this collisions are very bad as the structure cannot store 2 items with the same hash value; the part asking for an item would not have enough information to know which one it needs.
My initial thought is that 2 hashes, either using 2 different algorithms, or the same algorithm twice, with 2 salts would be more collision resistant. Two items having the same hash for different hashing algorithms would be very unlikely.
I was hoping to keep the hash value 32-bits for space reasons, so I thought I could switch to using two 16-bit algorithms instead of one 32-bit algorithm. But that would not increase the range of possible hash values...
I know that switching to two 32-bit hashes would be more collision resistant, but I am wondering if switching to 2 16-bit hashes has at least some gain over a single 32-bit hash? I am not the most mathematically inclined person, so I do not even know how to begin checking for an answer other than to bruit force it...
Some background on the system:
Items are given names by humans, they are not random strings, and will typically be made of words, letters, and numbers with no whitespace. It is a nested hash structure, so if you had something like { a => { b => { c => 'blah' }}} you would get the value 'blah' by getting value of a/b/c, the compiled request would be 3 hash values in immediate sequence, the hashe values of a, b, and then c.
There is only a problem when there is a collision on a given level. A collision between an item at the top level and a lower level is fine. You can have { a => {a => {...}}}, almost guaranteeing collisions that are on different levels (not a problem).
In practice any given level will likely have less than 100 values to hash, and none will be duplicates on the same level.
To test the hashing algorithm I adopted (forgot which one, but I did not invent it) I downloaded the entire list of CPAN Perl modules, split all namespaces/modules into unique words, and finally hashed each one searching for collisions, I encountered 0 collisions. That means that the algorithm has a different hash value for each unique word in the CPAN namespace list (Or that I did it wrong). That seems good enough to me, but its still nagging at my brain.
If you have 2 16 bit hashes, that are producing uncorrelated values, then you have just written a 32-bit hash algorithm. That will not be better or worse than any other 32-bit hash algorithm.
If you are concerned about collisions, be sure that you are using a hash algorithm that does a good job of hashing your data (some are written to merely be fast to compute, this is not what you want), and increase the size of your hash until you are comfortable.
This raises the question of the probability of collisions. It turns out that if you have n things in your collection, there are n * (n-1) / 2 pairs of things that could collide. If you're using a k bit hash, the odds of a single pair colliding are 2-k. If you have a lot of things, then the odds of different pairs colliding is almost uncorrelated. This is exactly the situation that the Poisson distribution describes.
Thus the number of collisions that you will see should approximately follow the Poisson distribution with λ = n * (n-1) * 2-k-1. From that the probability of no hash collisions is about e-λ. With 32 bits and 100 items, the odds of a collision in one level are about 1.1525 in a million. If you do this enough times, with enough different sets of data, eventually those one in a million chances will add up.
But note that you have many normal sized levels and a few large ones, the large ones will have a disproportionate impact on your risk of collision. That is because each thing you add to a collection can collide with any of the preceeding things - more things equals higher risk of collision. So, for instance, a single level with 1000 data items has about 1 chance in 10,000 of failing - which is about the same risk as 100 levels with 100 data items.
If the hashing algorithm is not doing its job properly, your risk of collision will go up rapidly. How rapidly depends very much on the nature of the failure.
Using those facts and your projections for what the usage of your application is, you should be able to decide whether you're comfortable with the risk from 32-bit hashes, or whether you should move up to something larger.