Boolean matrix multiplication in Julia - boolean

I need to multiply two boolean matrices in Julia.
Doing simply A*A or A^2 returns an Int64 Matrix.
Is there a way how to multiply efficiently boolean matrices?

Following Oscar's comment of adding two for loops around your code, but without the LoopVectorization improvement, although with not allocating the full array inside the any call (so that the any stops on the first occurrence), this is decently fast (edit: replaced standard AND & with short-circuit &&):
function bool_mul2(A, B)
mA, nA = size(A)
mB, nB = size(B)
nA ≠ mB && error()
AB = BitArray(undef, mA, nB)
for i in 1:mA, j in 1:nB
AB[i,j] = any(A[i,k] && B[k,j] for k in 1:nA)
end
AB
end
(Note I removed the [ and ] inside the any to not allocate there.
E.g., with A and B of size 1000×1000, I get
julia> #btime bool_mul2($A, $B) ;
16.128 ms (3 allocations: 122.25 KiB)
compared to
julia> #btime bool_mul($A, $B) ;
346.374 ms (12 allocations: 7.75 MiB)
EDIT: For squaring the matrix, maybe try
function bool_square(A)
m, n = size(A)
m ≠ n && error()
A² = BitArray(undef, n, n)
for i in 1:n, j in 1:n
A²[i,j] = any(A[i,k] && A[k,j] for k in 1:n)
end
A²
end
for which I get
julia> A = rand(Bool, 500, 500) ;
julia> #btime $A * $A .!= 0 ;
42.483 ms (12 allocations: 1.94 MiB)
julia> #btime bool_square($A) ;
4.653 ms (3 allocations: 30.69 KiB)

One very simple solution is
function bool_mul(A,B)
return A*B .!= 0
end
This won't be the most efficient since it will allocate a matrix for A*B, but might end up being one of the fastest solutions available.

Related

Good Julia code for a bifurcation diagram

I am in the process of migrating all my Matlab code into Julia. I have an old Matlab script that produces the standard bifurcation diagram for the logistic map, quite fast: ≈0.09s for the loop, ≈0.11s for the plot to come out (Matlab 2019a). The logistic map is well known and I will be brief here: x(t+1) = r*x(t)*(1-x(t)), 0 ≤ r ≤ 4. For accuracy, I choose maxiter = 1000 and r = LinRange(0.0,4.0, 6001).
I have tried to rewrite my Matlab code into Julia, but I am still a clumsy Julia programmer. The best I could come up with was to get 1.341015 seconds (44.79 M allocations: 820.892 MiB, 4.60% gc time) for the loop to run, and Plots.jl takes 2.266 s (6503764 allocations: 283.60 MiB) to save a pdf file of the plot (not bad), while it takes around 17s to get the plot to be seen in the Atom plots pane (that's OK with Plots). This was done with Julia 1.5.3 (both in Atom and VS Code).
I would be grateful if someone could provide some help with my Julia code below. It runs, but it looks a bit primitive and slow. I tried to change the style and looked for performance tips (#inbounds, #simd, #avx), but always got stuck in one problem or another. It simply does not make sense to have the same loop 15 times faster in Matlab than in Julia, and I know that. Actually, there is a piece of Matlab code that I particularly like (by Steve Brunton), which is extremely simple and elegant, and (apparently) easy to be re-written in Julia; but I stumbled here as well. Brunton's loops run in just ≈0.04s, and can be found below.
Help will be appreciated. Thanks. The plot is the usual one:
enter image description here
My Matlab code:
tic
hold on % required for plotting all points
maxiter = 1000;
r1 = 0;
r4 = 4;
Tot = 6001;
r = linspace(r1, r4, Tot); % Number of r values (6001 points)
np = length(r);
y = zeros(maxiter+1, Tot); % Pre-allocation
y(1,1:np) = 0.5; % Generic initial condition
for n = 1 : maxiter
y(n+1,:) = r.*y(n,:) .* (1-y(n,:)); % Iterates all the r values at once
end
toc
tic
for n = maxiter-100 : maxiter+1 % eliminates transients
plot(r,y(n,:),'b.','Markersize',0.01)
grid on
yticks([0 0.2 0.4 0.6 0.8 1])
end
hold off
toc
My Julia code:
using Plots
using BenchmarkTools
#using LoopVectorization
#using SIMD
#time begin
rs = LinRange(0.0,4.0, 6001)
#rs = collect(rs)
x1 = 0.5
maxiter = 1000 # maximum iterations
x = zeros(length(rs), maxiter) # for each starting condition (across rows)
#for k = 1:length(rs)
for k in eachindex(rs)
x[k,1] = x1 # initial condition
for j = 1 : maxiter-1
x[k, j+1] = rs[k] * x[k, j] * (1 - x[k,j])
end
end
end
#btime begin
plot(rs, x[:,end-50:end], #avoiding transients
seriestype = :scatter,
markercolor=:blue,
markerstrokecolor=:match,
markersize = 1,
markerstrokewidth = 0,
legend = false,
markeralpha = 0.3)
#xticks! = 0:1:4
xlims!(0.01,4)
end
Steve Brunton's Matlab code:
tic
xvals=[];
for beta = linspace(0,4,6001)
beta;
xold = 0.5;
%transient
for i = 1:500
xnew = (beta*(xold-xold^2));
xold = xnew;
end
%xnew = xold;
xss = xnew;
for i = 1:1000;
xnew = ((xold-xold^2)*beta);
xold = xnew;
xvals(1,length(xvals)+1) = beta; % saving beta values
xvals(2,length(xvals)) = xnew; % saving xnew values
if (abs(xnew-xss) < .001)
break
end
end
end
toc
tic
plot (xvals(1,:),xvals(2,:),'b.', 'Linewidth', 0.1, 'Markersize', 1)
grid on
%xlim([2.5 4])
toc
#time begin
rs = LinRange(0.0,4.0, 6001)
x1 = 0.5
maxiter = 1000 # maximum iterations
x = zeros(length(rs), maxiter)
for k in eachindex(rs)
x[k,1] = x1 # initial condition
for j = 1 : maxiter-1
x[k, j+1] = rs[k] * x[k, j] * (1 - x[k,j])
end
end
end
shows:
1.490238 seconds (44.79 M allocations: 820.892 MiB, 5.81% gc time)
whereas
#time begin
let
rs = LinRange(0.0,4.0, 6001)
x1 = 0.5
maxiter = 1000 # maximum iterations
x = zeros(length(rs), maxiter)
for k in eachindex(rs)
x[k,1] = x1 # initial condition
for j = 1 : maxiter-1
x[k, j+1] = rs[k] * x[k, j] * (1 - x[k,j])
end
end
end
end
shows
0.044452 seconds (2 allocations: 45.784 MiB, 29.09% gc time)
let introduces a new scope, so your problem is that you're running in global scope.
The compiler finds it difficult to optimise code in global scope, because any variables can be accessed from any location in your (possibly 1000s of lines of) source code. As described in the manual, this is the number 1 reason why code can run slower than it should.
I do not know, why you were flagged, maybe this way of communication is not very suitable for StackOverflow, I can recommend to use https://discourse.julialang.org which is more suitable for long discussions.
Regarding your code, there couple of things that can be improved.
using Plots
using BenchmarkTools
using LoopVectorization
function bifur(rs, xs, maxiter)
x = zeros(length(rs), maxiter) # for each starting condition (across rows)
bifur!(x, rs, xs, maxiter)
end
function bifur!(x, rs, xs, maxiter)
# #avx - LoopVectorization is broken on julia nightly, so I had to switch to other options
#inbounds #simd for k = 1 : length(rs) # macro to vectorize the loop
x[k,1] = xs # initial condition
for j = 1 : maxiter-1
x[k, j+1] = rs[k] * x[k, j] * (1 - x[k,j])
end
end
return x
end
As you can see, I split bifur in two functions: mutating, which was denoted with exclamation mark and non mutating, which is just named bifur. This pattern is common in Julia and helps to properly benchmark and use code. If you want faster version which do not allocate, you use mutating version. If you want slower version with guaranteed result (i.e. it does not change between different runs) you use non mutating version.
Here you can see benchmark results
julia> #benchmark bifur($rs, $xs, $maxiter)
BenchmarkTools.Trial:
memory estimate: 45.78 MiB
allocs estimate: 2
--------------
minimum time: 38.556 ms (0.00% GC)
median time: 41.440 ms (0.00% GC)
mean time: 45.371 ms (10.08% GC)
maximum time: 170.765 ms (77.25% GC)
--------------
samples: 111
evals/sample: 1
This looks reasonable - best runtime is 38 ms and 2 allocations apparently coming from allocating x. Note also that variables xs and others were interpolated with $ symbol. It helps to benchmark properly, more can be read in BenchmarkTools.jl manual.
We can compare it with mutation version
julia> #benchmark bifur!(x, $rs, $xs, $maxiter) setup=(x = zeros(length($rs), $maxiter))
evals = 1
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 25.574 ms (0.00% GC)
median time: 27.556 ms (0.00% GC)
mean time: 27.717 ms (0.00% GC)
maximum time: 35.223 ms (0.00% GC)
--------------
samples: 98
evals/sample: 1
I had to add setup phase and now you can see that there are no allocation, as it can be expected and it runs slightly faster, 13 ms difference is x initialization.
Here is a note: your benchmark results are wrong, since you do not reset x between runs (no setup phase), so x was properly initialized on the first run, but on all other runs no additional calculations were performed. So you get something like 100 runs with first run equals to 30 ms and all other runs equals to 4ms, so on average you get 4 ms.
Now, you can use this function straightforwardly
x = bifur(rs, xs, maxiter)
#btime begin
plot($rs, #view($x[:, end-50:end]), #avoid transients
seriestype = :scatter,
markercolor=:blue,
markerstrokecolor=:match,
markersize = 1.1,
markerstrokewidth = 0,
legend = false,
markeralpha = 0.3)
#xticks! = 0:1:4
xlims!(2.75,4)
#savefig("zee_bifurcation.pdf")
end
Note that here I use interpolation again, and also #view for proper array slicing. Thing is, expressions like x[:, end-50:end] create new copy of the array, which can slow down calculations sometimes. Of course it is of little importance in the case of the Plots.jl, but can be useful in other calculations.
I was frustrated by the Julia code's poor style and performance for the bifurcation diagram of the logistic map I presented above. I am a kind of a newbie to Julia and expected some help for a problem that is not that difficult for an experienced Julia programmer. Well, looking for help, I just got knocked on my head with flags, telling me that I was deviating from my question or something else. I was not.
I feel better now. The code below runs two loops (inside a function), filling in a 6001×1000 Array{Float64,2} in just 3.276 ms (0 allocations: 0 bytes), and the plot comes out in 17.660 ms (83656 allocations: 11.32 MiB) right on the Atom's plots pane. I think it can still be improved, but (for me) it's OK as it stands. Julia rocks.
using Plots
using BenchmarkTools
using LoopVectorization
function bifur(rs, x, xs, maxiter)
#avx for k = 1 : length(rs) # macro to vectorize the loop
x[k,1] = xs # initial condition
for j = 1 : maxiter-1
x[k, j+1] = rs[k] * x[k, j] * (1 - x[k,j])
end
end
#return x
end
rs = LinRange(0.0, 4.0, 6001) #rs = collect(rs)
xs = 0.5
maxiter = 1000 # maximum iterations
x = zeros(length(rs), maxiter) # for each starting condition (across rows)
#btime begin
bifur(rs, x, xs, maxiter)
end
#benchmark bifur(rs, x, xs, maxiter)
#btime begin
plot(rs, x[:, end-50:end], #avoid transients
seriestype = :scatter,
markercolor=:blue,
markerstrokecolor=:match,
markersize = 1.1,
markerstrokewidth = 0,
legend = false,
markeralpha = 0.3)
#xticks! = 0:1:4
xlims!(2.75,4)
#savefig("zee_bifurcation.pdf")
end
The plot should look like this:
Does anyone have an idea of how to adapt Brunton's above Matlab code to Julia? I kind of like his style. It looks beautiful for teaching.
In the meantime, I learned that the #avx macro from LoopVectorization.jl should not be applied to two loops that are not independent (which is the case with these two loops), and if I change the order of the loops, performance will be enhanced 5x. So, a better piece of code to produce a bifurcation diagram for the logistic map will be something like this:
using Plots
using BenchmarkTools
#using LoopVectorization
using SIMD
function bifur!(rs, x, xs, maxiter)
x[:,1] .= xs
for j in 1:maxiter-1 #the j loop needs to be sequential, so should be the outer loop
#inbounds #simd for k in 1:length(rs) # the k loop should be the innermost since it is first on x.
x[k, j+1] = rs[k] * x[k, j] * (1 - x[k,j])
end
end
return x
end
rs = LinRange(0.0, 4.0, 6001) #rs = collect(rs)
xs = 0.5
maxiter = 1000 # maximum iterations
x = zeros(length(rs), maxiter)
bifur!(rs, x, xs, maxiter)
#benchmark bifur!($rs, x, $xs, $maxiter) setup=(x = zeros(length($rs), $maxiter)) evals=1
#btime begin
plot(rs, x[:, end-50:end], #avoid transients
#plot($rs, #view($x[:, end-50:end]),
seriestype = :scatter,
markercolor=:blue,
markerstrokecolor=:match,
markersize = 1.1,
markerstrokewidth = 0,
legend = false,
markeralpha = 0.3)
#xticks! = 0:1:4
xlims!(2.75,4)
#savefig("logistic_bifur.pdf")
end

MATLAB - Secant method produces NaN

I'm writing a secant method in MATLAB, which I want to iterate through exactly n times.
function y = secantmeth(f,xn_2,xn_1,n)
xn = (xn_2*f(xn_1) - xn_1*f(xn_2))/(f(xn_1) - f(xn_2));
k = 0;
while (k < n)
k = k + 1;
xn_2 = xn_1;
xn_1 = xn;
xn = (xn_2*f(xn_1) - xn_1*f(xn_2))/(f(xn_1) - f(xn_2));
end
y = xn;
end
I believe the method works for small values of n, but even something like n = 9 produces NaN. My guess is that the quantity f(xn_1) - f(xn_2) is approximately zero, which causes this error. How can I prevent this?
Examples:
Input 1
eqn = #(x)(x^2 + x -9)
secantmeth(eqn,2,3,5)
Input 2
eqn = #(x)(x^2 + x - 9)
secantmeth(eqn, 2, 3, 9)
Output 1
2.7321
Output 2
NaN
The value for xn will be NaN when xn_2 and xn_1 are exactly equal, which results in a 0/0 condition. You need to have an additional check in your while loop condition to see if xn_1 and x_n are equal (or, better yet, within some small tolerance of one another), thus suggesting that the loop has converged on a solution and can't iterate any further:
...
while (k < n) && (xn_1 ~= xn)
k = k + 1;
xn_2 = xn_1;
xn_1 = xn;
xn = (xn_2*f(xn_1) - xn_1*f(xn_2))/(f(xn_1) - f(xn_2));
end
...
As Ander mentions in a comment, you could then continue with a different method after your while loop if you want to try and get a more accurate approximation:
...
if (xn_1 == xn) % Previous loop couldn't iterate any further
% Try some new method
end
...
And again, I would suggest reading through this question to understand some of the pitfalls of floating-point comparison (i.e. == and ~= aren't usually the best operators to use for floating-point numbers).

matlab/octave - Generalized matrix multiplication

I would like to do a function to generalize matrix multiplication. Basically, it should be able to do the standard matrix multiplication, but it should allow to change the two binary operators product/sum by any other function.
The goal is to be as efficient as possible, both in terms of CPU and memory. Of course, it will always be less efficient than A*B, but the operators flexibility is the point here.
Here are a few commands I could come up after reading various interesting threads:
A = randi(10, 2, 3);
B = randi(10, 3, 4);
% 1st method
C = sum(bsxfun(#mtimes, permute(A,[1 3 2]),permute(B,[3 2 1])), 3)
% Alternative: C = bsxfun(#(a,b) mtimes(a',b), A', permute(B, [1 3 2]))
% 2nd method
C = sum(bsxfun(#(a,b) a*b, permute(A,[1 3 2]),permute(B,[3 2 1])), 3)
% 3rd method (Octave-only)
C = sum(permute(A, [1 3 2]) .* permute(B, [3 2 1]), 3)
% 4th method (Octave-only): multiply nxm A with nx1xd B to create a nxmxd array
C = bsxfun(#(a, b) sum(times(a,b)), A', permute(B, [1 3 2]));
C = C2 = squeeze(C(1,:,:)); % sum and turn into mxd
The problem with methods 1-3 are that they will generate n matrices before collapsing them using sum(). 4 is better because it does the sum() inside the bsxfun, but bsxfun still generates n matrices (except that they are mostly empty, containing only a vector of non-zeros values being the sums, the rest is filled with 0 to match the dimensions requirement).
What I would like is something like the 4th method but without the useless 0 to spare memory.
Any idea?
Here is a slightly more polished version of the solution you posted, with some small improvements.
We check if we have more rows than columns or the other way around, and then do the multiplication accordingly by choosing either to multiply rows with matrices or matrices with columns (thus doing the least amount of loop iterations).
Note: This may not always be the best strategy (going by rows instead of by columns) even if there are less rows than columns; the fact that MATLAB arrays are stored in a column-major order in memory makes it more efficient to slice by columns, as the elements are stored consecutively. Whereas accessing rows involves traversing elements by strides (which is not cache-friendly -- think spatial locality).
Other than that, the code should handle double/single, real/complex, full/sparse (and errors where it is not a possible combination). It also respects empty matrices and zero-dimensions.
function C = my_mtimes(A, B, outFcn, inFcn)
% default arguments
if nargin < 4, inFcn = #times; end
if nargin < 3, outFcn = #sum; end
% check valid input
assert(ismatrix(A) && ismatrix(B), 'Inputs must be 2D matrices.');
assert(isequal(size(A,2),size(B,1)),'Inner matrix dimensions must agree.');
assert(isa(inFcn,'function_handle') && isa(outFcn,'function_handle'), ...
'Expecting function handles.')
% preallocate output matrix
M = size(A,1);
N = size(B,2);
if issparse(A)
args = {'like',A};
elseif issparse(B)
args = {'like',B};
else
args = {superiorfloat(A,B)};
end
C = zeros(M,N, args{:});
% compute matrix multiplication
% http://en.wikipedia.org/wiki/Matrix_multiplication#Inner_product
if M < N
% concatenation of products of row vectors with matrices
% A*B = [a_1*B ; a_2*B ; ... ; a_m*B]
for m=1:M
%C(m,:) = A(m,:) * B;
%C(m,:) = sum(bsxfun(#times, A(m,:)', B), 1);
C(m,:) = outFcn(bsxfun(inFcn, A(m,:)', B), 1);
end
else
% concatenation of products of matrices with column vectors
% A*B = [A*b_1 , A*b_2 , ... , A*b_n]
for n=1:N
%C(:,n) = A * B(:,n);
%C(:,n) = sum(bsxfun(#times, A, B(:,n)'), 2);
C(:,n) = outFcn(bsxfun(inFcn, A, B(:,n)'), 2);
end
end
end
Comparison
The function is no doubt slower throughout, but for larger sizes it is orders of magnitude worse than the built-in matrix-multiplication:
(tic/toc times in seconds)
(tested in R2014a on Windows 8)
size mtimes my_mtimes
____ __________ _________
400 0.0026398 0.20282
600 0.012039 0.68471
800 0.014571 1.6922
1000 0.026645 3.5107
2000 0.20204 28.76
4000 1.5578 221.51
Here is the test code:
sz = [10:10:100 200:200:1000 2000 4000];
t = zeros(numel(sz),2);
for i=1:numel(sz)
n = sz(i); disp(n)
A = rand(n,n);
B = rand(n,n);
tic
C = A*B;
t(i,1) = toc;
tic
D = my_mtimes(A,B);
t(i,2) = toc;
assert(norm(C-D) < 1e-6)
clear A B C D
end
semilogy(sz, t*1000, '.-')
legend({'mtimes','my_mtimes'}, 'Interpreter','none', 'Location','NorthWest')
xlabel('Size N'), ylabel('Time [msec]'), title('Matrix Multiplication')
axis tight
Extra
For completeness, below are two more naive ways to implement the generalized matrix multiplication (if you want to compare the performance, replace the last part of the my_mtimes function with either of these). I'm not even gonna bother posting their elapsed times :)
C = zeros(M,N, args{:});
for m=1:M
for n=1:N
%C(m,n) = A(m,:) * B(:,n);
%C(m,n) = sum(bsxfun(#times, A(m,:)', B(:,n)));
C(m,n) = outFcn(bsxfun(inFcn, A(m,:)', B(:,n)));
end
end
And another way (with a triple-loop):
C = zeros(M,N, args{:});
P = size(A,2); % = size(B,1);
for m=1:M
for n=1:N
for p=1:P
%C(m,n) = C(m,n) + A(m,p)*B(p,n);
%C(m,n) = plus(C(m,n), times(A(m,p),B(p,n)));
C(m,n) = outFcn([C(m,n) inFcn(A(m,p),B(p,n))]);
end
end
end
What to try next?
If you want to squeeze out more performance, you're gonna have to move to a C/C++ MEX-file to cut down on the overhead of interpreted MATLAB code. You can still take advantage of optimized BLAS/LAPACK routines by calling them from MEX-files (see the second part of this post for an example). MATLAB ships with Intel MKL library which frankly you cannot beat when it comes to linear algebra computations on Intel processors.
Others have already mentioned a couple of submissions on the File Exchange that implement general-purpose matrix routines as MEX-files (see #natan's answer). Those are especially effective if you link them against an optimized BLAS library.
Why not just exploit bsxfun's ability to accept an arbitrary function?
C = shiftdim(feval(f, (bsxfun(g, A.', permute(B,[1 3 2])))), 1);
Here
f is the outer function (corrresponding to sum in the matrix-multiplication case). It should accept a 3D array of arbitrary size mxnxp and operate along its columns to return a 1xmxp array.
g is the inner function (corresponding to product in the matrix-multiplication case). As per bsxfun, it should accept as input either two column vectors of the same size, or one column vector and one scalar, and return as output a column vector of the same size as the input(s).
This works in Matlab. I haven't tested in Octave.
Example 1: Matrix-multiplication:
>> f = #sum; %// outer function: sum
>> g = #times; %// inner function: product
>> A = [1 2 3; 4 5 6];
>> B = [10 11; -12 -13; 14 15];
>> C = shiftdim(feval(f, (bsxfun(g, A.', permute(B,[1 3 2])))), 1)
C =
28 30
64 69
Check:
>> A*B
ans =
28 30
64 69
Example 2: Consider the above two matrices with
>> f = #(x,y) sum(abs(x)); %// outer function: sum of absolute values
>> g = #(x,y) max(x./y, y./x); %// inner function: "symmetric" ratio
>> C = shiftdim(feval(f, (bsxfun(g, A.', permute(B,[1 3 2])))), 1)
C =
14.8333 16.1538
5.2500 5.6346
Check: manually compute C(1,2):
>> sum(abs( max( (A(1,:))./(B(:,2)).', (B(:,2)).'./(A(1,:)) ) ))
ans =
16.1538
Without diving into the details, there are tools such as mtimesx and MMX that are fast general purpose matrix and scalar operations routines. You can look into their code and adapt them to your needs.
It would most likely be faster than matlab's bsxfun.
After examination of several processing functions like bsxfun, it seems it won't be possible to do a direct matrix multiplication using these (what I mean by direct is that the temporary products are not stored in memory but summed ASAP and then other sum-products are processed), because they have a fixed size output (either the same as input, either with bsxfun singleton expansion the cartesian product of dimensions of the two inputs). It's however possible to trick Octave a bit (which does not work with MatLab who checks the output dimensions):
C = bsxfun(#(a,b) sum(bsxfun(#times, a, B))', A', sparse(1, size(A,1)))
C = bsxfun(#(a,b) sum(bsxfun(#times, a, B))', A', zeros(1, size(A,1), 2))(:,:,2)
However do not use them because the outputted values are not reliable (Octave can mangle or even delete them and return 0!).
So for now on I am just implementing a semi-vectorized version, here's my function:
function C = genmtimes(A, B, outop, inop)
% C = genmtimes(A, B, inop, outop)
% Generalized matrix multiplication between A and B. By default, standard sum-of-products matrix multiplication is operated, but you can change the two operators (inop being the element-wise product and outop the sum).
% Speed note: about 100-200x slower than A*A' and about 3x slower when A is sparse, so use this function only if you want to use a different set of inop/outop than the standard matrix multiplication.
if ~exist('inop', 'var')
inop = #times;
end
if ~exist('outop', 'var')
outop = #sum;
end
[n, m] = size(A);
[m2, o] = size(B);
if m2 ~= m
error('nonconformant arguments (op1 is %ix%i, op2 is %ix%i)\n', n, m, m2, o);
end
C = [];
if issparse(A) || issparse(B)
C = sparse(o,n);
else
C = zeros(o,n);
end
A = A';
for i=1:n
C(:,i) = outop(bsxfun(inop, A(:,i), B))';
end
C = C';
end
Tested with both sparse and normal matrices: the performance gap is a lot less with sparse matrices (3x slower) than with normal matrices (~100x slower).
I think this is slower than bsxfun implementations, but at least it doesn't overflow memory:
A = randi(10, 1000);
C = genmtimes(A, A');
If anyone has any better to offer, I'm still looking for a better alternative!

Improving on the efficiency of randsample in MATLAB for a Markov chain simulation.

I am using matlab to simulate an accumulation process with several random walks that accumulate towards threshold in parallel. To select which random walk will increase at time t, randsample is used. If the vector V represents the active random walks and vector P represents the probability with which each random walk should be selected then the call to randsample looks like this:
randsample(V, 1, true, P);
The problem is that the simulations are slow, and randsample is the bottleneck. Approximately 80% of the runtime is dedicated to resolving the randsample call.
Is there a relatively straightforward way to improve upon the efficiency of randsample? Are there other alternatives that might improve the speed?
Like I mentioned in the comments, the bottleneck is properly caused by the fact that you are sampling one value at a time, it would be faster if you vectorize the randsample call (of course I am assuming that the probabilities vector is constant).
Here is a quick benchmark:
function testRandSample()
v = 1:5;
w = rand(numel(v),1); w = w ./ sum(w);
n = 50000;
% timeit
t(1) = timeit(#() func1(v, w, n));
t(2) = timeit(#() func2(v, w, n));
t(3) = timeit(#() func3(v, w, n));
disp(t)
% check distribution of samples (should be close to w)
tabulate(func1(v, w, n))
tabulate(func2(v, w, n))
tabulate(func3(v, w, n))
disp(w*100)
end
function s = func1(v, w, n)
s = randsample(v, n, true, w);
end
function s = func2(v, w, n)
[~,idx] = histc(rand(n,1), [0;cumsum(w(:))./sum(w)]);
s = v(idx);
end
function s = func3(v, w, n)
cw = cumsum(w) / sum(w);
s = zeros(n,1);
for i=1:n
s(i) = find(rand() <= cw, 1, 'first');
end
s = v(s);
%s = v(arrayfun(#(~)find(rand() <= cw, 1, 'first'), 1:n));
end
The output (annotated):
% measured elapsed times for func1/2/3 respectively
0.0016 0.0015 0.0790
% distribution of random sample from func1
Value Count Percent
1 4939 9.88%
2 15049 30.10%
3 7450 14.90%
4 11824 23.65%
5 10738 21.48%
% distribution of random sample from func2
Value Count Percent
1 4814 9.63%
2 15263 30.53%
3 7479 14.96%
4 11743 23.49%
5 10701 21.40%
% distribution of random sample from func3
Value Count Percent
1 4985 9.97%
2 15132 30.26%
3 7275 14.55%
4 11905 23.81%
5 10703 21.41%
% true population distribution
9.7959
30.4149
14.7414
23.4949
21.5529
As you can see, randsample is pretty well optimized. The bottleneck you observed in your code is probably due lack of vectorization as I explained.
To see how slow it can get, replace func1 with a looped version sampling one value at-a-time:
function s = func1(v, w, n)
s = zeros(n,1);
for i=1:n
s(i) = randsample(v, 1, true, w);
end
end
Maybe this will be faster:
find(rand <= cumsum(P), 1) %// gives the same as randsample(V, 1, true, P)
I'm assuming P are probabilities, i.e. their sum is 1. Otherwise normalize P:
find(rand <= cumsum(P)/sum(P), 1) %// gives the same as randsample(V, 1, true, P)
If P is always the same, precompute cumsum(P)/sum(P) to save time:
cp = cumsum(P)/sum(P); %// precompute (just once)
find(rand <= cP, 1) %// gives the same as randsample(V, 1, true, P)

Factorization of an integer

While answering another, I stumbled over the question how I actually could find all factors of an integer number without the Symbolic Math Toolbox.
For example:
factor(60)
returns:
2 2 3 5
unique(factor(60))
would therefore return all prime-factors, "1" missing.
2 3 5
And I'm looking for a function which would return all factors (1 and the number itself are not important, but they would be nice)
Intended output for x = 60:
1 2 3 4 5 6 10 12 15 20 30 60
I came up with that rather bulky solution, apart from that it probably could be vectorized, isn't there any elegant solution?
x = 60;
P = perms(factor(x));
[n,m] = size(P);
Q = zeros(n,m);
for ii = 1:n
for jj = 1:m
Q(ii,jj) = prod(P(ii,1:jj));
end
end
factors = unique(Q(:))'
Also I think, this solution will fail for certain big numbers, because perms requires a vector length < 11.
You can find all factors of a number n by dividing it by a vector containing the integers 1 through n, then finding where the remainder after division by 1 is exactly zero (i.e., the integer results):
>> n = 60;
>> find(rem(n./(1:n), 1) == 0)
ans =
1 2 3 4 5 6 10 12 15 20 30 60
Here is a comparison of six different implementations for finding factors of an integer:
function [t,v] = testFactors()
% integer to factor
%{45, 60, 2059, 3135, 223092870, 3491888400};
n = 2*2*2*2*3*3*3*5*5*7*11*13*17*19;
% functions to compare
fcns = {
#() factors1(n);
#() factors2(n);
#() factors3(n);
#() factors4(n);
%#() factors5(n);
#() factors6(n);
};
% timeit
t = cellfun(#timeit, fcns);
% check results
v = cellfun(#feval, fcns, 'UniformOutput',false);
assert(isequal(v{:}));
end
function f = factors1(n)
% vectorized implementation of factors2()
f = find(rem(n, 1:floor(sqrt(n))) == 0);
f = unique([1, n, f, fix(n./f)]);
end
function f = factors2(n)
% factors come in pairs, the smaller of which is no bigger than sqrt(n)
f = [1, n];
for k=2:floor(sqrt(n))
if rem(n,k) == 0
f(end+1) = k;
f(end+1) = fix(n/k);
end
end
f = unique(f);
end
function f = factors3(n)
% Get prime factors, and compute products of all possible subsets of size>1
pf = factor(n);
f = arrayfun(#(k) prod(nchoosek(pf,k),2), 2:numel(pf), ...
'UniformOutput',false);
f = unique([1; pf(:); vertcat(f{:})])'; %'
end
function f = factors4(n)
% http://rosettacode.org/wiki/Factors_of_an_integer#MATLAB_.2F_Octave
pf = factor(n); % prime decomposition
K = dec2bin(0:2^length(pf)-1)-'0'; % all possible permutations
f = ones(1,2^length(pf));
for k=1:size(K)
f(k) = prod(pf(~K(k,:))); % compute products
end;
f = unique(f); % eliminate duplicates
end
function f = factors5(n)
% #LuisMendo: brute-force implementation
f = find(rem(n, 1:n) == 0);
end
function f = factors6(n)
% Symbolic Math Toolbox
f = double(evalin(symengine, sprintf('numlib::divisors(%d)',n)));
end
The results:
>> [t,v] = testFactors();
>> t
t =
0.0019 % factors1()
0.0055 % factors2()
0.0102 % factors3()
0.0756 % factors4()
0.1314 % factors6()
>> numel(v{1})
ans =
1920
Although the first vectorized version is the fastest, the equivalent loop-based implementation (factors2) is not far behind, thanks to automatic JIT optimization.
Note that I had to disable the brute-force implementation (factors5()) because it throws an out-of-memory error (storing the vector 1:3491888400 in double-precision requires over 26GB of memory!). This method is obviously not feasible for large integers, neither space- or time-wise.
Conclusion: use the following vectorized implementation :)
n = 3491888400;
f = find(rem(n, 1:floor(sqrt(n))) == 0);
f = unique([1, n, f, fix(n./f)]);
An improvement over #gnovice's answer is to skip the division operation: rem alone is enough:
n = 60;
find(rem(n, 1:n)==0)