Is hashing a good way to compare two 32x64 two dimensional arrays exactly? - hash

I'm trying to implement Conway's Game of Life on an embedded device. I've only got 1kb of RAM to play with and in total there are 2048 cells which equals 512 bytes. I'm going to calculate the next generation 8x8 cells at a time so that I don't have to store two generations in RAM at any one point.
However what I would also like to do is detect when the GoL is stuck in a loop/static state. When I created a mockup on a PC I simply stored the last hundreth and thousandth generation and compared the current generation to it. I can't do this with 1kb of RAM, what I am thinking of doing is simply calculating a hash of the last x generation and comparing the hash of that to the hash of the current generation.
There are some very light implementations of XTEA or SHA1 but I'm not sure if hashing is really fit for this purpose since I need to determine if each individual cell in both generations are equal. What would you recommend?
Thanks,
Joe
EDIT: Just thinking, I could actually count the number of matches and if it reaches a certain threshold then assume that it is in a loop, that wouldn't work very well for patterns that recur every thousand generations or so though.

I think it's quite good choice. The probability for hash collision is so low, hence it's acceptable for application as yours, it is not nuclear reactor.

Hashing is good to tell when things are not equal. If the hashes are equal, then you still need (well ought) to do the individual comparison.

I decided to just get a device with more RAM but one thing that I observed is that if there is a pattern then the same pattern will be matched every x generations whilst if it was just a random hash collision then it won't. So if we have the following generations:
123*
231
312
123*
231
312
123*
123 gets matched every 3 generations. This wouldn't occur with a hash collision.

Related

Choosing a minimum hash size for a given allowable number of collisions

I am parsing a large amount of network trace data. I want to split the trace into chunks, hash each chunk, and store a sequence of the resulting hashes rather than the original chunks. The purpose of my work is to identify identical chunks of data - I'm hashing the original chunks to reduce the data set size for later analysis. It is acceptable in my work that we trade off the possibility that collisions occasionally occur in order to reduce the hash size (e.g. 40 bit hash with 1% misidentification of identical chunks might beat 60 bit hash with 0.001% misidentification).
My question is, given a) number of chunks to be hashed and b) allowable percentage of misidentification, how can one go about choosing an appropriate hash size?
As an example:
1,000,000 chunks to be hashed, and we're prepared to have 1% misidentification (1% of hashed chunks appear identical when they are not identical in the original data). How do we choose a hash with the minimal number of bits that satisifies this?
I have looked at materials regarding the Birthday Paradox, though this is concerned specifically with the probability of a single collision. I have also looked at materials which discuss choosing a size based on an acceptable probability of a single collision, but have not been able to extrapolate from this how to choose a size based on an acceptable probability of n (or fewer) collisions.
Obviously, the quality of your hash function matters, but some easy probability theory will probably help you here.
The question is what exactly are you willing to accept, is it good enough that you have an expected number of collisions at only 1% of the data? Or, do you demand that the probability of the number of collisions going over some bound be something? If its the first, then back of the envelope style calculation will do:
Expected number of pairs that hash to the same thing out of your set is (1,000,000 C 2)*P(any two are a pair). Lets assume that second number is 1/d where d is the the size of the hashtable. (Note: expectations are linear, so I'm not cheating very much so far). Now, you say you want 1% collisions, so that is 10000 total. Well, you have (1,000,000 C 2)/d = 10,000, so d = (1,000,000 C 2)/10,000 which is according to google about 50,000,000.
So, you need a 50 million ish possible hash values. That is a less than 2^26, so you will get your desired performance with somewhere around 26 bits of hash (depending on quality of hashing algorithm). I probably have a factor of 2 mistake in there somewhere, so you know, its rough.
If this is an offline task, you cant be that space constrained.
Sounds like a fun exercise!
Someone else might have a better answer, but I'd go the brute force route, provided that there's ample time:
Run the hashing calculation using incremental hash size and record the collision percentage for each hash size.
You might want to use binary search to reduce the search space.

Is there a collision rate difference between one 32-bit hash vs two 16 bit hashes?

I am working on a system where hash collisions would be a problem. Essentially there is a system that references items in a hash-table+tree structure. However the system in question first compiles text-files containing paths in the structure into a binary file containing the hashed values instead. This is done for performance reasons. However because of this collisions are very bad as the structure cannot store 2 items with the same hash value; the part asking for an item would not have enough information to know which one it needs.
My initial thought is that 2 hashes, either using 2 different algorithms, or the same algorithm twice, with 2 salts would be more collision resistant. Two items having the same hash for different hashing algorithms would be very unlikely.
I was hoping to keep the hash value 32-bits for space reasons, so I thought I could switch to using two 16-bit algorithms instead of one 32-bit algorithm. But that would not increase the range of possible hash values...
I know that switching to two 32-bit hashes would be more collision resistant, but I am wondering if switching to 2 16-bit hashes has at least some gain over a single 32-bit hash? I am not the most mathematically inclined person, so I do not even know how to begin checking for an answer other than to bruit force it...
Some background on the system:
Items are given names by humans, they are not random strings, and will typically be made of words, letters, and numbers with no whitespace. It is a nested hash structure, so if you had something like { a => { b => { c => 'blah' }}} you would get the value 'blah' by getting value of a/b/c, the compiled request would be 3 hash values in immediate sequence, the hashe values of a, b, and then c.
There is only a problem when there is a collision on a given level. A collision between an item at the top level and a lower level is fine. You can have { a => {a => {...}}}, almost guaranteeing collisions that are on different levels (not a problem).
In practice any given level will likely have less than 100 values to hash, and none will be duplicates on the same level.
To test the hashing algorithm I adopted (forgot which one, but I did not invent it) I downloaded the entire list of CPAN Perl modules, split all namespaces/modules into unique words, and finally hashed each one searching for collisions, I encountered 0 collisions. That means that the algorithm has a different hash value for each unique word in the CPAN namespace list (Or that I did it wrong). That seems good enough to me, but its still nagging at my brain.
If you have 2 16 bit hashes, that are producing uncorrelated values, then you have just written a 32-bit hash algorithm. That will not be better or worse than any other 32-bit hash algorithm.
If you are concerned about collisions, be sure that you are using a hash algorithm that does a good job of hashing your data (some are written to merely be fast to compute, this is not what you want), and increase the size of your hash until you are comfortable.
This raises the question of the probability of collisions. It turns out that if you have n things in your collection, there are n * (n-1) / 2 pairs of things that could collide. If you're using a k bit hash, the odds of a single pair colliding are 2-k. If you have a lot of things, then the odds of different pairs colliding is almost uncorrelated. This is exactly the situation that the Poisson distribution describes.
Thus the number of collisions that you will see should approximately follow the Poisson distribution with λ = n * (n-1) * 2-k-1. From that the probability of no hash collisions is about e-λ. With 32 bits and 100 items, the odds of a collision in one level are about 1.1525 in a million. If you do this enough times, with enough different sets of data, eventually those one in a million chances will add up.
But note that you have many normal sized levels and a few large ones, the large ones will have a disproportionate impact on your risk of collision. That is because each thing you add to a collection can collide with any of the preceeding things - more things equals higher risk of collision. So, for instance, a single level with 1000 data items has about 1 chance in 10,000 of failing - which is about the same risk as 100 levels with 100 data items.
If the hashing algorithm is not doing its job properly, your risk of collision will go up rapidly. How rapidly depends very much on the nature of the failure.
Using those facts and your projections for what the usage of your application is, you should be able to decide whether you're comfortable with the risk from 32-bit hashes, or whether you should move up to something larger.

When generating a SHA256 / 512 hash, is there a minimum 'safe' amount of data to hash?

I have heard that when creating a hash, it's possible that if small files or amounts of data are used, the resulting hash is more likely to suffer from a collision. If that is true, is there a minimum "safe" amount of data that should be used to ensure this doesn't happen?
I guess the question could also be phrased as:
What is the smallest amount of data that can be safely and securely hashed?
A hash function accepts inputs of arbitrary (or at least very high) length, and produces a fixed-length output. There are more possible inputs than possible outputs, so collisions must exist. The whole point of a secure hash function is that it is "collision resistant", which means that while collisions must mathematically exist, it is very very hard to actually compute one. Thus, there is no known collision for SHA-256 and SHA-512, and the best known methods for computing one (by doing it on purpose) are so ludicrously expensive that they will not be applied soon (the whole US federal budget for a century would buy only a ridiculously small part of the task).
So, if it cannot be realistically done on purpose, you can expect not to hit a collision out of (bad) luck.
Moreover, if you limit yourself to very short inputs, there is a chance that there is no collision at all. E.g., if you consider 12-byte inputs: there are 296 possible sequences of 12 bytes. That's huge (more than can be enumerated with today's technology). Yet, SHA-256 will map each input to a 256-bit value, i.e. values in a much wider space (of size 2256). We cannot prove it formally, but chances are that all those 296 hash values are distinct from each other. Note that this has no practical consequence: there is no measurable difference between not finding a collision because there is none, and not finding a collision because it is extremely improbable to hit one.
Just to illustrate how low risks of collision are with SHA-256: consider your risks of being mauled by a gorilla escaped from a local zoo or private owner. Unlikely? Yes, but it still may conceivably happen: it seems that a gorilla escaped from the Dallas zoo in 2004 and injured four persons; another gorilla escaped from the same zoo in 2010. Assuming that there is only one rampaging gorilla every 6 years on the whole Earth (not only in the Dallas area) and you happen to be the unlucky chap who is on his path, out of a human population of 6.5 billions, then risks of grievous-bodily-harm-by-gorilla can be estimated at about 1 in 243.7 per day. Now, take 10 thousands of PC and have them work on finding a collision for SHA-256. The chances of hitting a collision are close to 1 in 275 per day -- more than a billion less probable than the angry ape thing. The conclusion is that if you fear SHA-256 collisions but do not keep with you a loaded shotgun at all times, then you are getting your priorities wrong. Also, do not mess with Texas.
There is no minimum input size. SHA-256 algorithm is effectively a random mapping and collision probability doesn't depend on input length. Even a 1 bit input is 'safe'.
Note that the input is padded to a multiple of 512 bits (64 bytes) for SHA-256 (multiple of 1024 for SHA-512). Taking a 12 byte input (as Thomas used in his example), when using SHA-256, there are 2^96 possible sequences of length 64 bytes.
As an example, a 12 byte input Hello There! (0x48656c6c6f20546865726521) will be padded with a one bit, followed by 351 zero bits followed by the 64 bit representation of the length of the input in bits which is 0x0000000000000060 to form a 512 bit padded message. This 512 bit message is used as the input for computing the hash.
More details can be found in RFC: 4634 "US Secure Hash Algorithms (SHA and HMAC-SHA)", http://www.ietf.org/rfc/rfc4634.txt
No, message length does not effect the likeliness of a collision.
If that were the case, the algorithm is broken.
You can try for yourself by running SHA against all one-byte inputs, then against all two-byte inputs and so on, and see if you get a collision. Probably not, because no one has ever found a collision for SHA-256 or SHA-512 (or at least they kept it a secret from Wikipedia)
Τhe hash is 256 bits long, there is a collision for anything longer than 256bits.
Υou cannot compress something into a smaller thing without having collisions, its defying mathmatics.
Yes, because of the algoritm and the 2 to the power of 256 there is a lot of different hashes, but they are not collision free, that is impossible.
Depends very much on your application: if you were simply hashing "YES" and "NO" strings to send across a network to indicate whether you should give me a $100,000 loan, it would be a pretty big failure -- the domain of answers can't be that large, so someone could easily check observed hashes on the wire against a database of 'small input' hash outputs.
If you were to include the date, time, my name, my tax ID, the amount requested, the amount of data being hashed probably won't amount to much, but the chances of that data being in precomputed hash tables is pretty slim.
But I know of no research to point you to beyond my instincts. Sorry.

Hash function combining - is there a significant decrease in collision risk?

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.

Uniquely identifying URLs with one 64-bit number

This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what kind of collision frequency should I expect?
How does the answer change if I only have 100 million URLs?
It seems to me that collisions will be extremely rare, but these things tend to be confusing.
Would I be better off using something other than MD5? Mind you, I'm not looking for security, just a good fast hash function. Also, native support in MySQL is nice.
EDIT: not quite a duplicate
If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.
I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.
You have tagged this as "birthday-paradox", I think you know the answer already.
P(Collision) = 1 - (2^64)!/((2^64)^n (1 - n)!)
where n is 1 billion in your case.
You will be a bit better using something other then MD5, because MD5 have pratical collusion problem.
From what I see, you need a hash function with the following requirements,
Hash arbitrary length strings to a 64-bit value
Be good -- Avoid collisions
Not necessarily one-way (security not required)
Preferably fast -- which is a necessary characteristic for a non-security application
This hash function survey may be useful for drilling down to the function most suitable for you.
I will suggest trying out multiple functions from here and characterizing them for your likely input set (pick a few billion URL that you think you will see).
You can actually generate another column like this test survey for your test URL list to characterize and select from the existing or any new hash functions (more rows in that table) that you might want to check. They have MSVC++ source code to start with (reference to ZIP link).
Changing the hash functions to suit your output width (64-bit) will give you a more accurate characterization for your application.
If you have 2^n hash possibilities, there's over a 50% chance of collision when you have 2^(n/2) items.
E.G. if your hash is 64 bits, you have 2^64 hash possibilities, you'd have a 50% chance of collision if you have 2^32 items in a collection.
Just by using a hash, there is always a chance of collisions. And you don't know beforehand wether collisions will happen once or twice, or even hundreds or thousands of times in your list of urls.
The probability is still just a probability. Its like throwing a dice 10 or 100 times, what are the chances of getting all sixes? The probability says it is low, but it still can happen. Maybe even many times in a row...
So while the birthday paradox shows you how to calculate the probabilities, you still need to decide if collisions are acceptable or not.
...and collisions are acceptable, and hashes are still the right way to go; find a 64 bit hashing algorithm instead of relying on "half-a-MD5" having a good distribution. (Though it probably has...)