Probability of SHA1 collisions - hash

Given a set of 100 different strings of equal length, how can you quantify the probability that a SHA1 digest collision for the strings is unlikely... ?

Are the 160 bit hash values generated
by SHA-1 large enough to ensure the
fingerprint of every block is unique?
Assuming random hash values with a
uniform distribution, a collection of
n different data blocks and a hash
function that generates b bits, the
probability p that there will be one
or more collisions is bounded by the
number of pairs of blocks multiplied
by the probability that a given pair
will collide.
(source : http://bitcache.org/faq/hash-collision-probabilities)

Well, the probability of a collision would be:
1 - ((2^160 - 1) / 2^160) * ((2^160 - 2) / 2^160) * ... * ((2^160 - 99) / 2^160)
Think of the probability of a collision of 2 items in a space of 10. The first item is unique with probability 100%. The second is unique with probability 9/10. So the probability of both being unique is 100% * 90%, and the probability of a collision is:
1 - (100% * 90%), or 1 - ((10 - 0) / 10) * ((10 - 1) / 10), or 1 - ((10 - 1) / 10)
It's pretty unlikely. You'd have to have many more strings for it to be a remote possibility.
Take a look at the table on this page on Wikipedia; just interpolate between the rows for 128 bits and 256 bits.

That's Birthday Problem - the article provides nice approximations that make it quite easy to estimate the probability. Actual probability will be very very very low - see this question for an example.

Related

can Wolfram factor 300 digit RSA Number?

Everybody knows its hard to factor over 100 digit public key but 250 digits RSA number have already been factored and Wolfram was able to factor a 300 digit number.
I tried factorizing public key n=144965985551789672595298753480434206198100243703843869912732139769072770813192027176664780173715597401753521552885189279272665054124083640424760144394629798590902883058370807005114592169348977123961322905036962506399782515793487504942876237605818689905761084423626547637902556832944887103223814087385426838463 using a simple prime factor program but it has encountered an error and it keeps repeating.
import math
i = 0 n=144965985551789672595298753480434206198100243703843869912732139769072770813192027176664780173715597401753521552885189279272665054124083640424760144394629798590902883058370807005114592169348977123961322905036962506399782515793487504942876237605818689905761084423626547637902556832944887103223814087385426838463
p = math.floor(math.sqrt(n))
while n % p != 0:
p -= 1
i += 1
q = int(n / p)
print( p,q)
and heres the results:
Next I tried Sieve of Eratosthenes
import time
import math
def sieve(b):
global prime_list
for a in prime_list:
if (a % prime_list[b] == 0 and a != prime_list[b]):
prime_list.remove(a)
inp = 144965985551789672595298753480434206198100243703843869912732139769072770813192027176664780173715597401753521552885189279272665054124083640424760144394629798590902883058370807005114592169348977123961322905036962506399782515793487504942876237605818689905761084423626547637902556832944887103223814087385426838463
prime_list = []
i = 2
b = 0
t = time.time()
while i <= int(inp):
prime_list.append(i)
i += 1
while b < len(prime_list):
sieve(b)
b += 1
print(prime_list)
print("length of list: " + str(len(prime_list)))
print("time took: " + str((time.time()-t)))
That doesn't work either. But, I believe a 300 digit number can be factored. I just don't understand why so many programmers who gave up that easily said it's impossible.
It is easier to factor a number if the factors are small. Here is a nice big 300 digit number for you:
100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
It’s pretty obvious what the factors are, right? The prime factors are 2299 5299, and that should be obvious from looking at the number.
So, some numbers are easier to factor than others.
RSA keys are made from two large prime numbers multiplied together, with no small factors. For a 309-digit number, the factors might each be over 150 digits. So if you try to use the sieve of Eratosthenes to factor a large prime number, what will happen is that your program will try to calculate all the primes up to 150 digits, and that is simply too many prime numbers to calculate.
The number of prime numbers with less than 150 digits is:
About 10147, so your program would take at least that many processor cycles to finish. This number is so large, that if we took all of the computers in the entire world (maybe 1021 or 1022 operations per second), your program would take more than 10117 years to run (again, using all of the computers in the entire world).
I just don't understand why so many programmers who gave up that easily said it's impossible.
It’s because factoring numbers is known to be a very hard problem—the people who give up are giving up because they know what the state of the art algorithms are for factoring large numbers (the General Number Field Sieve) and know that 300 digits is just not feasible without some kind of radical new development in algorithms, computers, or engineering.

Good Values for Knuth's multiplication method - Hashing

Any guesses what are good values for the "c"?
// h(k) = ⌊(m · (k · c − ⌊k · c⌋)⌋
return (int) Math.floor(distinctElements * (key * c - Math.floor(key * c)));
I have found the this one: (Math.sqrt(5)-1)/2)
However are there any other good choices known?
Kind regards
The value you've found is the number known as phi (φ), the golden ratio. It's not actually correct to use that... what you should be using is its multiplicative inverse, φ^-1. φ^-1 and φ^-2 have the special property of tending to optimally spread out the hash values regardless of the range of your inputs.
However, for sufficiently large range, pretty much any irrational number will do.

How is eps() calculated in MATLAB?

The eps routine in MATLAB essentially returns the positive distance between floating point numbers. It can take an optional argument, too.
My question: How does MATLAB calculate this value? (Does it use a lookup table, or does it use some algorithm to calculate it at runtime, or something else...?)
Related: how could it be calculated in any language providing bit access, given a floating point number?
WIkipedia has quite the page on it
Specifically for MATLAB it's 2^(-53), as MATLAB uses double precision by default. Here's the graph:
It's one bit for the sign, 11 for the exponent and the rest for the fraction.
The MATLAB documentation on floating point numbers also show this.
d = eps(x), where x has data type single or double, returns the positive distance from abs(x) to the next larger floating-point number of the same precision as x.
As not all fractions are equally closely spaced on the number line, different fractions will show different distances to the next floating-point within the same precision. Their bit representations are:
1.0 = 0 01111111111 0000000000000000000000000000000000000000000000000000
0.9 = 0 01111111110 1100110011001100110011001100110011001100110011001101
the sign for both is positive (0), the exponent is not equal and of course their fraction is vastly different. This means that the next floating point numbers would be:
dec2bin(typecast(eps(1.0), 'uint64'), 64) = 0 01111001011 0000000000000000000000000000000000000000000000000000
dec2bin(typecast(eps(0.9), 'uint64'), 64) = 0 01111001010 0000000000000000000000000000000000000000000000000000
which are not the same, hence eps(0.9)~=eps(1.0).
Here is some insight into eps which will help you to write an algorithm.
See that eps(1) = 2^(-52). Now, say you want to compute the eps of 17179869183.9. Note that, I have chosen a number which is 0.1 less than 2^34 (in other words, something like 2^(33.9999...)). To compute eps of this, you can compute log2 of the number, which would be ~ 33.99999... as mentioned before. Take a floor() of this number and add it to -52, since eps(1) = 2^(-52) and the given number 2^(33.999...). Therefore, eps(17179869183.9) = -52+33 = -19.
If you take a number which is fractionally more than 2^34, e.g., 17179869184.1, then the log2(eps(17179869184.1)) = -18. This also shows that the eps value will change for the numbers that are integer powers of your base (or radix), in this case 2. Since eps value only changes at those numbers which are integer powers of 2, we take floor of the power. You will be able to get the perfect value of eps for any number using this. I hope it is clear.
MATLAB uses (along with other languages) the IEEE754 standard for representing real floating point numbers.
In this format the bits allocated for approximating the actual1 real number, usually 32 - for single or 64 - for double precision, are grouped into: 3 groups
1 bit for determining the sign, s.
8 (or 11) bits for exponent, e.
23 (or 52) bits for the fraction, f.
Then a real number, n, is approximated by the following three - term - relation:
n = (-1)s * 2(e - bias) * (1 + fraction)
where the bias offsets negatively2 the values of the exponent so that they describe numbers between 0 and 1 / (1 and 2) .
Now, the gap reflects the fact that real numbers does not map perfectly to their finite, 32 - or 64 - bit, representations, moreover, a range of real numbers that differ by abs value < eps maps to a single value in computer memory, i.e: if you assign a values val to a variable var_i
var_1 = val - offset
...
var_i = val;
...
val_n = val + offset
where
offset < eps(val) / 2
Then:
var_1 = var_2 = ... = var_i = ... = var_n.
The gap is determined from the second term containing the exponent (or characteristic):
2(e - bias)
in the above relation3, which determines the "scale" of the "line" on which the approximated numbers are located, the larger the numbers, the larger the distance between them, the less precise they are and vice versa: the smaller the numbers, the more densely located their representations are, consequently, more accurate.
In practice, to determine the gap of a specific number, eps(number), you can start by adding / subtracting a gradually increasing small number until the initial value of the number of interest changes - this will give you the gap in that (positive or negative) direction, i.e. eps(number) / 2.
To check possible implementations of MATLAB's eps (or ULP - unit of last place , as it is called in other languages), you could search for ULP implementations either in C, C++ or Java, which are the languages MATLAB is written in.
1. Real numbers are infinitely preciser i.e. they could be written with arbitrary precision, i.e. with any number of digits after the decimal point.
2. Usually around the half: in single precision 8 bits mean decimal values from 1 to 2^8 = 256, around the half in our case is: 127, i.e. 2(e - 127)
2. It can be thought that: 2(e - bias), is representing the most significant digits of the number, i.e. the digits that contribute to describe how big the number is, as opposed to the least significant digits that contribute to describe its precise location. Then the larger the term containing the exponent, the smaller the significance of the 23 bits of the fraction.

Expected chain length after rehashing - Linear Hashing

There is one confusion I've about load factor. Some sources say that it is just the number of keys in hash table divided by total number of slots which is same as expected chain length for each slot. But that is only in simple uniform hashing right?
Suppose hash table T has n elements and we expand T into T1 by redistributing elements in slot T[0] by rehashing them using h'(k) = k mod 2m. The hash function of T1 is h(k) = k mod 2m if h(k) < 1 and k mod m if h(k) >= 1. Many sources say that we "Expand and rehash to maintain the load factor (does this imply expected chain length is still same?) Since this is not simple uniform hashing, I think the probability that any key k enters a slot is (1/4 + 1/2(m-1)).
For a key k (randomly selected), h(k) is first evaluated (there are 50-50 chances whether it is less than 1 or greater than or equal to 1) and then if it's less than 1, key k has JUST two ways - slot 0 or slot m. Hence, probability 1/4 (1/2 * 1/2) But if it is greater than or equal to 1, it has m-1 slots and could enter any and hence probability (1/2 * 1/m-1). So expected chain length would now be n/4 + n/2(m-1). Am I on right track?
The calculation for linear hashing should be the same as for "non-linear" hashing. With a certain initial number of buckets, uniform distribution of hash values would result in uniform placement. With enough expansions to double the size of the table, each of those values would be randomly split over the larger space via the incremental re-hashing, and new values would also have been distributed over the larger space. Incrementally, each point is equally likely to be at (initial bucket position) and (2x initial bucket position) as the table expands to that length.
There is a paper here which goes into detail about the chain length calculation under different circumstances (not just the average), specifically for linear hashing.

The number of correct decimal digits in a product of doubles with a large number of terms

What is a tight lower-bound on the size of the set of irrational numbers, N, expressed as doubles in Matlab on a 64-bit machine, that I multiply together while having confidence in k decimal digits of the product? What precision, for example could I expect after multiplying together ~10^12 doubles encoding different random chunks of pi?
If you ask for tight bound, the response of #EricPostpischil is the absolute error bound if all operations are performed in IEEE 754 double precision.
If you ask for confidence, I understand it as statistical distribution of errors. Assuming a uniform distribution of error in [-e/2,e/2] you could try to ask for theoretical distribution of error after M operations on math stack exchange... I guess the tight bound is somehow very conservative.
Let's illustrate an experimental estimation of those stats with some Smalltalk code (any language having large integer/fraction arithmetic could do):
nOp := 500.
relativeErrorBound := ((1 + (Float epsilon asFraction / 2)) raisedTo: nOp * 2 - 1) - 1.0.
nToss := 1000.
stats := (1 to: nToss)
collect: [:void |
| fractions exactProduct floatProduct relativeError |
fractions := (1 to: nOp) collect: [:e | 10000 atRandom / 3137].
exactProduct := fractions inject: 1 into: [:prod :element | prod * element].
floatProduct := fractions inject: 1.0 into: [:prod :element | prod * element].
relativeError := (floatProduct asFraction - exactProduct) / exactProduct.
relativeError].
s1 := stats detectSum: [:each | each].
s2 := stats detectSum: [:each | each squared].
maxEncounteredError := (stats detectMax: [:each | each abs]) abs asFloat.
estimatedMean := (s1 /nToss) asFloat.
estimatedStd := (s2 / (nToss-1) - (s1/nToss) squared) sqrt.
I get these results for multiplication of nOp=20 double:
relativeErrorBound -> 4.440892098500626e-15
maxEncounteredError -> 1.250926201710214e-15
estimatedMean -> -1.0984634797115124e-18
estimatedStd -> 2.9607828266493842e-16
For nOp=100:
relativeErrorBound -> 2.220446049250313e-14
maxEncounteredError -> 2.1454964094158273e-15
estimatedMean -> -1.8768492273800676e-17
estimatedStd -> 6.529482793500846e-16
And for nOp=500:
relativeErrorBound -> 1.1102230246251565e-13
maxEncounteredError -> 4.550696454362764e-15
estimatedMean -> 9.51007740905571e-17
estimatedStd -> 1.4766176010100097e-15
You can observe that the standard deviation growth is much more slow than that of error bound.
UPDATE: at first approximation (1+e)^m = 1+m*e+O((m*e)^2), so the distribution is approximately a sum of m uniform in [-e,e] as long as m*e is small enough, and this sum is very near a normal distribution (gaussian) of variance m*(2e)^2/12. You can check that std(sum(rand(100,5000))) is near sqrt(100/12) in Matlab.
We can consider it is still true for m=2*10^12-1, that is approximately m=2^41, m*e=2^-12. In which case, the global error is a quasi normal distribution and the standard deviation of global error is sigma=(2^-52*sqrt(2^41/12)) or approximately sigma=10^-10
See http://en.wikipedia.org/wiki/Normal_distribution to compute P(abs(error)>k*sigma)
In 68% of case (1 sigma), you'll have 10 digits of precision or more.
erfc(10/sqrt(2)) gives you the probability to have less than 9 digits of precision, about 1 case out of 6*10^22, so I let you compute the probability of having only 4 digits of precision (you can't evaluate it with double precision, it underflows) !!!
My experimental standard deviation were a bit smaller than theoretical ones (2e-15 9e-16 4e-16 for 20 100 & 500 double) but this must be due to a biased distriution of my inputs errors i/3137 i=1..10000...
That's a good way to remind that the result will be dominated by the distribution of errors in your inputs, which might exceed e if they result from floating point operations like M_PI*num/den
Also, as Eric said, using only * is quite an ideal case, things might degenerate quicker if you mix +.
Last note: we can craft a list of inputs that reach the maximum error bound, set all elements to be (1+e) which will be rounded to 1.0, and we get the maximum theoretical error bound, but our input distribution is quite biased! HEM WRONG since all multiplication are exact we get only (1+e)^n, not (1+e)^(2n-1), so about only half the error...
UPDATE 2: the inverse problem
Since you want the inverse, what is the length n of sequence such that I get k digits of precision with a certain level of confidence 10^-c
I'll answer only for k>=8, because (m*e) << 1 is required in above approximations.
Let's take c=7, you get k digits with a confidence of 10^-7 means 5.3*sigma < 10^-k.
sigma = 2*e*sqrt((2*n-1)/12) that is n=0.5+1.5*(sigma/e)^2 with e=2^-53.
Thus n ~ 3*2^105*sigma^2, as sigma^2 < 10^-2k/5.3^2, we can write n < 3*2^105*10^-(2*k)/5.3^2
A.N. the probability of having less than k=9 digits is less than 10^-7 for a length n=4.3e12, and around n=4.3e10 for 10 digits.
We would reach n=4 numbers for 15 digits, but here our normal distribution hypothesis is very rough and does not hold, especially distribution tail at 5 sigmas, so use with caution (Berry–Esseen theorem bounds how far from normal is such distribution http://en.wikipedia.org/wiki/Berry-Esseen_theorem )
The relative error in M operations as described is at most (1+2-53)M-1, assuming all input, intermediate, and final values do not underflow or overflow.
Consider converting a real number a0 to double precision. The result is some number a0•(1+e), where -2-53 ≤ e ≤ 2-53 (because conversion to double precision should always produce the closest representable value, and the quantum for double precision values is 2-53 of the highest bit, and the closest value is always within half a quantum). For further analysis, we will consider the worst case value of e, 2-53.
When we multiply one (previously converted) value by another, the mathematically exact result is a0•(1+e) • a1•(1+e). The result of the calculation has another rounding error, so the calculated result is a0•(1+e) • a1•(1+e) • (1+e) = a0 • a1 • (1+e)3. Obviously, this is a relative error of (1+e)3. We can see the error accumulates simply as (1+e)M for these operations: Each operation multiplies all previous error terms by 1+e.
Given N inputs, there will be N conversions and N-1 multiplications, so the worst error will be (1+e)2 N - 1.
Equality for this error is achieved only for N≤1. Otherwise, the error must be less than this bound.
Note that an error bound this simple is possible only in a simple problem, such as this one with homogeneous operations. In typical floating-point arithmetic, with a mixture of addition, subtraction, multiplication, and other operations, computing a bound so simply is generally not possible.
For N=1012 (M=2•1012-1), the above bound is less than 2.000222062•1012 units of 2-53, and is less than .0002220693. So the calculated result is good to something under four decimal digits. (Remember, though, you need to avoid overflow and underflow.)
(Note on the strictness of the above calculation: I used Maple to calculate 1000 terms of the binomial (1+2-53)2•1012-1 exactly (having removed the initial 1 term) and to add a value that is provably larger than the sum of all remaining terms. Then I had Maple evaluate that exact result to 1000 decimal digits, and it was less than the bound I report above.)
For 64 bit floating point numbers, assuming the standard IEEE 754, has 52+1 bits of mantissa.
That means relative precision is between 1.0000...0 and 1.0000...1, where the number of binary digits after the decimal point is 52. (You can think of the 1.000...0 as what is stored in binary in the mantissa AKA significand).
The error is 1/2 to the power of 52 divided by 2 (half the resolution). Note I choose the relative precision as close to 1.0 as possible, because it is the worst case (otherwise between 1.111..11 and 1.111..01, it is more precise).
In decimal, the worst case relative precision of a double is 1.11E-16.
If you multiply N doubles with this precision, the new relative precision (assuming no additional error due to intermediate rounding) is:
1 - (1 - 1.11E-16)^N
So if you multiply pi (or any double 10^12) times, the upper bound on error is:
1.1102e-004
That means you can have confidence in about 4-5 digits.
You can ignore intermediate rounding error if your CPU has support for extended precision floating point numbers for intermediate results.
If there is no extended precision FPU (floating point unit) used, rounding in intermediate steps introduces additional error (same as due to multiplication). That means that a strict lower bound calculated as:
1 -
((1 - 1.11E-16) * (1 - 1.11E-16) * (1 - 1.11E-16)
* (1 - 1.11E-16) * (1 - 1.11E-16) % for multiplication, then rounding
... (another N-4 lines here) ...
* (1 - 1.11E-16) * (1 - 1.11E-16))
= 1-(1-1.11E-16)^(N*2-1)
If N is too large, it takes too long to run. The possible error (with intermediate rounding) is 2.2204e-012, which is double compared to without intermediate rounding 1-(1 - 1.11E-16)^N=1.1102e-012.
Approximately, we can say that intermediate rounding doubles the error.
If you multiplied pi 10^12 times, and there was no extended precision FPU. This might be because you write intermediate steps to memory (and perhaps do something else), before continuing (just make sure the compiler hasn't reordered your instructions so that there is no FPU result accumulation), then a strict upper bound on your relative error is:
2.22e-004
Note that confidence in decimal places doesn't mean it will be exactly that decimal places sometimes.
For example, if the answer is:
1.999999999999, and the error is 1E-5, the actual answer could be 2.000001234.
In this case, even the first decimal digit was wrong. But that really depends on how lucky you are (whether the answer falls on a boundary such as this).
This solution assumes that the doubles (including the answer) are all normalized. For denormalized results, obviously, the number binary digits by which it is denormalized will reduce the accuracy by that many digits.