How can I write a hash function that works for a fraction of arbitrary size, and is equal for different fractions that reduce the same way?
I tried converting it to a double but two fractions that are reduced to the same number may be off by the slightest amount and thus have completely different hashes.
GCD would make useless the O(1) time of a hash function, and making all fractions always recuced would slow my program significantly.
Related
I am writing a program to decrypt RSA-encrypted messages by factorising the prime number and I want to test it on some long primes so I need to use Variable Precision Arithmetic, but for some reason even when I use small numbers it is significantly slower when I use VPA with a small number of digits. What is going on?
Is it done in O(1) or O(n) or somewhere in between? Is there any disadvantage to computing the hash of a very large object vs a small one? If it matters, I'm using Python.
Generally speaking, computing a hash will be O(1) for "small" items and O(N) for "large" items (where "N" denotes the size of an item's key). The precise dividing line between small and large varies, but is typically somewhere in the general vicinity of the size of a register (e.g., 32 bits on a 32-bit machine, 64 bits on a 64-bit machine). This can also depend on the input type--for example, integer types up on the register size all hashing with constant complexity, but strings taking time proportional to the size in bytes, right down to a single character (i.e., a two-character string taking roughly twice the time of a single character string).
Once you've computed the hash, accessing the hash table has expected constant complexity, but can be as bad as O(N) in the worst case (but this is a different "N"--the number of items inserted in the table, not the size of an individual key).
The real answer is it depends. You didn't specify what hash function you are interested in. When we are talking about cryptographic hash like SHA256, then complexity is O(n). When we are talking about hash function that take last two digits of phone number, then it will be O(1). Hash functions that are used in hash tables tend to be optimized for speed and thus are closer to O(1).
For further reference on hash tables see this page from python wiki on Time Complexity.
Most of the time your hash is going to compute in access at O(1). However, if it is a really bad hash where every value has the same hash, it will be O(n) worst case.
The more objects associated to the hash is equivalent to more collisions.
If I have some data I hash with SHA256 like this :- hash=SHA256(data)
And then copy only the first 8 bytes of the hash instead of the whole 32 bytes, how easy is it to find a hash collision with different data? Is it 2^64 or 2^32 ?
If I need to reduce a hash of some data to a smaller size (n bits) is there any way to ensure the search space 2^n ?
I think you're actually interested in three things.
The first you need to understand is the entropy distribution of the hash. If the output of a hash function is n-bits long, then the maximum entropy is n bits. Note that I say maximum; you are never guaranteed to have n bits of entropy. Similarly, if you truncate the hash output to n/4 bits, you are not guaranteed to have a 2n/4 bits of entropy in the result. SHA-256 is fairly uniformly distributed, which means in part that you are unlikely to have more entropy in the high bits than the low bits (or vice versa).
However, information on this is sparse because the hash function is intended to be used with its whole hash output. If you only need an 8-byte hash output, then you might not even need a cryptographic hash function and could consider other algorithms. (The point is, if you need a cryptographic hash function, then you need as many bits as it can give you, as shortening the output weakens the security of the function.)
The second is the search space: it is not dependent on the hash function at all. Searching for an input that creates a given output on a hash function is more commonly known as a Brute-Force attack. The number of inputs that will have to be searched does not depend on the hash function itself; how could it? Every hash function output is the same: every SHA-256 output is 256 bits. If you just need a collision, you could find one specific input that generated each possible output of 256 bits. Unfortunately, this would take up a minimum storage space of 256 * 2256 ≈ 3 * 1079 for just the hash values themselves (i.e. not counting the inputs needed to generate them), which vastly eclipses the entire hard drive capacity of the entire world.
Therefore, the search space depends on the complexity and length of the input to the hash function. If your data is 8-character long ASCII strings, then you're pretty well guaranteed to never have a collision, BUT the search space for those hash values is only 27*8 ≈ 7.2 * 1016, which could be searched by your computer in a few minutes, probably. After all, you don't need to find a collision if you can find the original input itself. This is why salts are important in cryptography.
Third, you're interested in knowing the collision resistance. As GregS' linked article points out, the collision resistance of a space is much more limited than the input search space due to the pigeonhole principle.
Every hash function with more inputs than outputs will necessarily have collisions. Consider a hash function such as SHA-256 that produces 256 bits of output from an arbitrarily large input. Since it must generate one of 2256 outputs for each member of a much larger set of inputs, the pigeonhole principle guarantees that some inputs will hash to the same output. Collision resistance doesn't mean that no collisions exist; simply that they are hard to find.
The "birthday paradox" places an upper bound on collision resistance: if a hash function produces N bits of output, an attacker who computes "only" 2N/2 (or sqrt(2N)) hash operations on random input is likely to find two matching outputs. If there is an easier method than this brute force attack, it is typically considered a flaw in the hash function.
So consider what happens when you examine and store only the first 8 bytes (one fourth) of your output. Your collision resistance has dropped from 2256/2 = 2128 to 264/2 = 232. How much smaller is 232 than 2128? It's a whole lot smaller, as it turns out, approximately 0.0000000000000000000000000001% of the size at best.
I have a program where I deal with a lot of very small numbers (towards the lower end of the Double limits).
During the execution of my application, some of these numbers progressively get smaller meaning their "estimation" is less accurate.
My solution at the moment is scaling them up before I do any calculations and then scaling them back down again?
...but it's got me thinking, am I actually gaining any more "accuracy" by doing this?
Thoughts?
Are your numbers really in the region between 10^-308 (smallest normalized double) and 10^-324 (smallest representable double, denormalized i.e. losing precision)? If so, then by scaling them up you do indeed gain accuracy by working around the limits of the exponent range of the double type.
I have to wonder though: what kind of application deals with numbers that extremely small? I know of no physical discipline that needs anything like that.
A double has a fixed number of significant digits, and another fixed number of bytes to represent the "power"-part.
In fact you may, therefore, have two issues:
Regarding the power-part: that is what approaching the limit of small doubles is about.
Scaling them up (by powers of 2) helps avoid that your number becomes no longer representable.
when you write about the the accuracy of "estimation", I assume you refer to the number of significant digits: that is not related to the small-number-limit. A number that is very small, but not too small in the sense of the lower limit for doubles, has the same number of significant digits as any "more normal" number.
Concerns about numerical precision of a number should, generally speaking, focus on how the number is computed, rather than on the absolute size of the result.
I know that say given a md5/sha1 of a value, that reducing it from X bits (ie 128) to say Y bits (ie 64 bits) increases the possibility of birthday attacks since information has been lost. Is there any easy to use tool/formula/table that will say what the probability of a "correct" guess will be when that length reduction occurs (compared to its original guess probability)?
Crypto is hard. I would recommend against trying to do this sort of thing. It's like cooking pufferfish: Best left to experts.
So just use the full length hash. And since MD5 is broken and SHA-1 is starting to show cracks, you shouldn't use either in new applications. SHA-2 is probably your best bet right now.
I would definitely recommend against reducing the bit count of hash. There are too many issues at stake here. Firstly, how would you decide which bits to drop?
Secondly, it would be hard to predict how the dropping of those bits would affect the distribution of outputs in the new "shortened" hash function. A (well-designed) hash function is meant to distribute inputs evenly across the whole of the output space, not a subset of it.
By dropping half the bits you are effectively taking a subset of the original hash function, which might not have nearly the desirably properties of a properly-designed hash function, and may lead to further weaknesses.
Well, since every extra bit in the hash provides double the number of possible hashes, every time you shorten the hash by a bit, there are only half as many possible hashes and thus the chances of guessing that random number is doubled.
128 bits = 2^128 possibilities
thus
64 bits = 2^64
so by cutting it in half, you get
2^64 / 2^128 percent
less possibilities