Calculating Hash Collisions with 160 bits - hash

Assume a hash function that produces digests of 160 bits. How many messages do we need to hash to get a collision with approximately 75% probability?
Thank you for you help :)

The rule of thumb is that there's a 50% chance of a collision after sqrt(n) numbers are drawn. The number is slightly more than that, but the square root is a good guideline. So in your case you have a 50% chance of collision after 2^80 tries.
The other rule of thumb is that after 4*sqrt(n), your probability of getting a duplicate is nearly a certainty.
According to https://en.wikipedia.org/wiki/Birthday_problem#Probability_of_a_shared_birthday_(collision), you can compute the number, n of values you need to draw to get a probability p of a duplicate by:
n = sqrt(2 * d * ln(1/(1-p)))
Where ln is the natural logarithm, and p is the probability from 0 to 1.0.
So in your case:
n = sqrt(2 * 2^160 * ln(1/.25))
n = sqrt(2^161 * 1.38629)
Which is something less than 2^81.

Somewhere in the range of 2 septillion. That's 2,000,000,000,000,000,000,000,000 messages. Here's the equation.
chance of collision = 1 - e^(-n^2 / (2 * d))
Where n is the number of messages, d is the number of possibilities. So if d is 2^160, then n is going to be in the neighbourhood of 2^80.7.

Related

Possible "Traveling Salesman" function in Matlab?

I am looking to solve a Traveling Salesman type problem using a matrix in order to find the minimum time between transitions. The matrix looks something like this:
A = [inf 4 3 5;
1 inf 3 5;
4 5 inf 3;
6 7 1 inf]
The y-axis represents the "from" node and the x-axis represents the "to" node. I am trying to find the optimal time from node 1 to node 4. I was told that there is a Matlab function called "TravellingSalesman". Is that true, and if not, how would I go about solving this matrix?
Thanks!
Here's an outline of the brute-force algorithm to solve TSP for paths from node 1 to node n:
C = inf
P = zeros(1,n-2)
for each permutation P of the nodes [2..n-1]
// paths always start from node 1 and end on node n
C = A(1,P(1)) + A(P(1),P(2)) + A(P(2),P(3)) + ... +
A(P(n-3),P(n-2)) + A(P(n-2),n)
if C < minCost
minCost = C
minPath = P
elseif C == minCost // you only need this part if you want
minPath = [minPath; P] // ALL paths with the shortest distance
end
end
Note that the first and last factors in the sum are different because you know beforehand what the first and last nodes are, so you don't have to include them in the permutations. So in the example given, with n=4, there are actually only 2!=2 possible paths.
The list of permutations can be precalculated using perms(2:n-1), but that might involve storing a large matrix (n! x n). Or you can calculate the cost as you generate each permutation. There are several files on the Mathworks file exchange with names like nextPerm that should work for you. Either way, as n grows you're going to be generating a very large number of permutations and your calculations will take a very long time.

What is the meaning of number 1e5?

I have seen in some codes that people define a variable and assign values like 1e-8 or 1e5.
for example
const int MAXN = 1e5 + 123;
What are these numbers? I couldn't find any thing on the web...
1e5 is a number expressed using scientific notation and it means 1 multiplied by 10 to the 5th power (the e meaning 'exponent')
so 1e5 equals 1*100000and is equal to 100000, the three notations are interchangeable meaning the same.
1e5 means 1 × 105.
Similarly, 12.34e-9 means 12.34 × 10−9.
Generally, AeB means A × 10B.
this is scientific notation for 10^5 = 100000
1e5 is 100000. 5 stand for the amount of zeros you add in behind that number. For example, lets say I have 1e7. I would put 7 zeros behind 1 so it will become 10,000,000. But lets say that the number is 1.234e6. You would still add 6 zeros at the end of the number so it's 1.234000000, but since there is that decimal, you would have to move it to the right 6 times since it's e6.
The values like:
1e-8 or 1e5
means;
1e-8 = 1 * 10^(-8)
And
1e5 = 1 * 10

Reverse multiplication of 32-bit numbers

I have two large signed 32-bit numbers (java ints) being multiplied together such that they'll overflow. Actually, I have one of the numbers, and the result. Can I determine what the other operand was?
knownResult = unknownOperand * knownOperand;
Why? I have a string and a suffix being hashed with fnv1a. I know the resulting hash and the suffix, I want to see how easy it is to determine the hash of the original string.
This is the core of fnv1a:
hash ^= byte
hash *= PRIME
It depends. If the multiplier is even, at least one bit must inevitably be lost. So I hope that prime isn't 2.
If it's odd, then you can absolutely reverse it, just multiply by the modular multiplicative inverse of the multiplier to undo the multiplication.
There is an algorithm to calculate the modular multiplicative inverse modulo a power of two in Hacker's Delight.
For example, if the multiplier was 3, then you'd multiply by 0xaaaaaaab to undo (because 0xaaaaaaab * 3 = 1). For 0x01000193, the inverse is 0x359c449b.
You want to solve the equation y = prime * x for x, which you do by division in the finite ring modulo 232: x = y / prime.
Technically you do that by multiplying y with the multiplicative inverse of the prime modulo 232, which can be computed by the extended Euclidean algorithm.
Uh, division? Or am I not understanding the question?
It's not the fastest method, but something very easy to memorise is this:
unsigned inv(unsigned x) {
unsigned xx = x * x;
while (xx != 1) {
x *= xx;
xx *= xx;
}
return x;
}
It returns x**(2**n-1) (as in x*(x**2)*(x**4)*(x**8)*..., or x**(1+2+4+8+...)). As the loop exit condition implies, x**(2**n) is 1 when n is big enough, provided x is odd.
So, x**(2**n-1) equals x**(2**n)/x equals 1/x equals the thing you multiply x by to get the value 1 (mod 2**n). Which you then apply:
knownResult = unknownOperand * knownOperand
knownResult * inv(knownOperand) = unknownOperand * knownOperand * inv(knownOperand)
knownResult * inv(knownOperand) = unknownOperand * 1
or simply:
unknownOperand = knownResult * inv(knownOperand);
But there are faster ways, as given in other answers here. This one's just easy to remember.
Also, obligatory SO "use a library function" answer: BN_mod_inverse().

Extremely large weighted average

I am using 64 bit matlab with 32g of RAM (just so you know).
I have a file (vector) of 1.3 million numbers (integers). I want to make another vector of the same length, where each point is a weighted average of the entire first vector, weighted by the inverse distance from that position (actually it's position ^-0.1, not ^-1, but for example purposes). I can't use matlab's 'filter' function, because it can only average things before the current point, right? To explain more clearly, here's an example of 3 elements
data = [ 2 6 9 ]
weights = [ 1 1/2 1/3; 1/2 1 1/2; 1/3 1/2 1 ]
results=data*weights= [ 8 11.5 12.666 ]
i.e.
8 = 2*1 + 6*1/2 + 9*1/3
11.5 = 2*1/2 + 6*1 + 9*1/2
12.666 = 2*1/3 + 6*1/2 + 9*1
So each point in the new vector is the weighted average of the entire first vector, weighting by 1/(distance from that position+1).
I could just remake the weight vector for each point, then calculate the results vector element by element, but this requires 1.3 million iterations of a for loop, each of which contains 1.3million multiplications. I would rather use straight matrix multiplication, multiplying a 1x1.3mil by a 1.3milx1.3mil, which works in theory, but I can't load a matrix that large.
I am then trying to make the matrix using a shell script and index it in matlab so only the relevant column of the matrix is called at a time, but that is also taking a very long time.
I don't have to do this in matlab, so any advice people have about utilizing such large numbers and getting averages would be appreciated. Since I am using a weight of ^-0.1, and not ^-1, it does not drop off that fast - the millionth point is still weighted at 0.25 compared to the original points weighting of 1, so I can't just cut it off as it gets big either.
Hope this was clear enough?
Here is the code for the answer below (so it can be formatted?):
data = load('/Users/mmanary/Documents/test/insertion.txt');
data=data.';
total=length(data);
x=1:total;
datapad=[zeros(1,total) data];
weights = ([(total+1):-1:2 1:total]).^(-.4);
weights = weights/sum(weights);
Fdata = fft(datapad);
Fweights = fft(weights);
Fresults = Fdata .* Fweights;
results = ifft(Fresults);
results = results(1:total);
plot(x,results)
The only sensible way to do this is with FFT convolution, as underpins the filter function and similar. It is very easy to do manually:
% Simulate some data
n = 10^6;
x = randi(10,1,n);
xpad = [zeros(1,n) x];
% Setup smoothing kernel
k = 1 ./ [(n+1):-1:2 1:n];
% FFT convolution
Fx = fft(xpad);
Fk = fft(k);
Fxk = Fx .* Fk;
xk = ifft(Fxk);
xk = xk(1:n);
Takes less than half a second for n=10^6!
This is probably not the best way to do it, but with lots of memory you could definitely parallelize the process.
You can construct sparse matrices consisting of entries of your original matrix which have value i^(-1) (where i = 1 .. 1.3 million), multiply them with your original vector, and sum all the results together.
So for your example the product would be essentially:
a = rand(3,1);
b1 = [1 0 0;
0 1 0;
0 0 1];
b2 = [0 1 0;
1 0 1;
0 1 0] / 2;
b3 = [0 0 1;
0 0 0;
1 0 0] / 3;
c = sparse(b1) * a + sparse(b2) * a + sparse(b3) * a;
Of course, you wouldn't construct the sparse matrices this way. If you wanted to have less iterations of the inside loop, you could have more than one of the i's in each matrix.
Look into the parfor loop in MATLAB: http://www.mathworks.com/help/toolbox/distcomp/parfor.html
I can't use matlab's 'filter' function, because it can only average
things before the current point, right?
That is not correct. You can always add samples (i.e, adding or removing zeros) from your data or from the filtered data. Since filtering with filter (you can also use conv by the way) is a linear action, it won't change the result (it's like adding and removing zeros, which does nothing, and then filtering. Then linearity allows you to swap the order to add samples -> filter -> remove sample).
Anyway, in your example, you can take the averaging kernel to be:
weights = 1 ./ [3 2 1 2 3]; % this kernel introduces a delay of 2 samples
and then simply:
result = filter(w,1,[data, zeros(1,3)]); % or conv (data, w)
% removing the delay introduced by the kernel
result = result (3:end-1);
You considered only 2 options:
Multiplying 1.3M*1.3M matrix with a vector once or multiplying 2 1.3M vectors 1.3M times.
But you can divide your weight matrix to as many sub-matrices as you wish and do a multiplication of n*1.3M matrix with the vector 1.3M/n times.
I assume that the fastest will be when there will be the smallest number of iterations and n is such that creates the largest sub-matrix that fits in your memory, without making your computer start swapping pages to your hard drive.
with your memory size you should start with n=5000.
you can also make it faster by using parfor (with n divided by the number of processors).
The brute force way will probably work for you, with one minor optimisation in the mix.
The ^-0.1 operations to create the weights will take a lot longer than the + and * operations to compute the weighted-means, but you re-use the weights across all the million weighted-mean operations. The algorithm becomes:
Create a weightings vector with all the weights any computation would need:
weights = (-n:n).^-0.1
For each element in the vector:
Index the relevent portion of the weights vector to consider the current element as the 'centre'.
Perform the weighted-mean with the weights portion and the entire vector. This can be done with a fast vector dot-multiply followed by a scalar division.
The main loop does n^2 additions and subractions. With n equal to 1.3 million that's 3.4 trillion operations. A single core of a modern 3GHz CPU can do say 6 billion additions/multiplications a second, so that comes out to around 10 minutes. Add time for indexing the weights vector and overheads, and I still estimate you could come in under half an hour.

MATLAB: unwrap function

I'm in a discussion with someone from Mathworks re: the unwrap function which has a "bug" in it for jump tolerances other than π, and would like to get some other perspectives:
Description
Q = unwrap(P) corrects the radian phase angles in a vector P by adding multiples of ±2π when absolute jumps between consecutive elements of P are greater than or equal to the default jump tolerance of π radians. If P is a matrix, unwrap operates columnwise. If P is a multidimensional array, unwrap operates on the first nonsingleton dimension.
Q = unwrap(P,tol)uses a jump tolerance tol instead of the default value, π.
There are two possible interpretations of the documentation:
Q = unwrap(P,tol) corrects the radian phase angles in a vector P by adding multiples of ±2π when absolute jumps between consecutive elements of P are greater than or equal to tol radians. If P is a matrix, unwrap operates columnwise. If P is a multidimensional array, unwrap operates on the first nonsingleton dimension.
Example:
>> x = mod(0:20:200,100); unwrap(x, 50)
ans =
0 20.0000 40.0000 60.0000 80.0000 81.6814 101.6814 121.6814 141.6814 161.6814 163.3628
Q = unwrap(P,tol) corrects the elements in a vector P by adding multiples of ±2*tol when absolute jumps between consecutive elements of P are greater than or equal to tol. If P is a matrix, unwrap operates columnwise. If P is a multidimensional array, unwrap operates on the first nonsingleton dimension.
Example:
>> x = mod(0:20:200,100); unwrap(x, 50)
ans =
0 20 40 60 80 100 120 140 160 180 200
The actual behavior of unwrap() in MATLAB (at least up to R2010a) is #1. My interpretation of unwrap() is that it's supposed to be #2, and therefore there is a bug in the behavior. If unwrap()'s behavior matched #2, then unwrap could be used as an inverse for mod for slowly-varying inputs, i.e. unwrap(mod(x,T),T/2) = x for vectors x where successive elements vary by less than tol=T/2.
Note that this 2nd interpretation is more general than angles, and can unwrap anything with a wraparound period T. (whether a default of T=2π for radians, 360 for degrees, 256 for 8-bit numbers, 65536 for 16-bit numbers, etc.)
So my question is:
Are there possible uses for behavior #1? Which interpretation makes more sense?
Interpretation #1 is how I read the documentation and I think it makes sense. I could imagine to use it for reconstructing the driven distance from a wheel encoder. For slow speeds the tolerance doesn't matter, but for high speeds (high enough to violate the sampling theorem, i.e. you have less than two samples per wheel rotation), the tolerance helps you to get the right reconstruction if you know the direction.
Another reason why #1 makes more sense is probably that the ordinary unwrap can be extended easily to a generic one and therefore there's no direct need for the period to be a parameter.
% example for 16 bit integers
>> x1 = [10 5 0 65535 65525];
T = 65536;
x2 = T * unwrap(x1 * 2 * pi / T) / (2 * pi)
x2 =
10.0000 5.0000 0 -1.0000 -11.0000
Or just make your own function:
function ret = generic_unwrap(x, T)
ret = T * unwrap(x * 2 * pi / T) / (2 * pi);
end
Behavor #1 makes sense, since the input is assumed to be radians, not degrees. The adjustment adds pi/2 if you're above jump tolerance, so that's fine.
What would be nice was if unwrap had a feature that allowed it to work on any kind of series, not simply on radian angles.
The jump tolerance is not sufficient to tell whether you have a series in radian, or degree, or any other kind, so there would need to be an additional input.
I had always assumed that the second behavior was the actual one, but never tested it out. A literal reading of the help file does indicate behavior #1. But that's not what one would ever want to do. As a simple example, consider doing an unwrapping in degrees
x = mod(0:30:720, 360)
y = unwrap(x,180)
obviously you would want y = 0:30:720, but instead you get ...
y =
Columns 1 through 7
0 30.0000 60.0000 90.0000 120.0000 150.0000 180.0000
Columns 8 through 14
210.0000 240.0000 270.0000 300.0000 330.0000 333.0088 363.0088
Columns 15 through 21
393.0088 423.0088 453.0088 483.0088 513.0088 543.0088 573.0088
Columns 22 through 25
603.0088 633.0088 663.0088 666.0176
which is wrong (y no longer corresponds to the same angle as x, which is the point of unwrap)
Can anyone give an example of when you would want behavior #1 (the current behavior?)
x = mod(0:30*pi/180:4*pi, 2*pi);
y = unwrap(x)*180/pi;
It works in radians, but not in degrees.