Why does the matlab bootstrap procedure evaluate N+1 times? - matlab

I wanted to bootstrap with matlab using the built in command "bootstrp". What I noticed was that the procedure makes N+1 iterations when I am only asking for N iterations. Why is that? When I build a manual loop to do the bootstrapping, so that it really just runs N times, then it is faster. Here is a minimal example of the problem:
clear all
global iterationcounter
tic
iterationcounter=0;
data=unifrnd(0,1,1,1000); %draw vector of 1000 random numbers
bootstat = bootstrp(100,#testmean,data); %evaluate function for 100 bootstrap samples
toc
which uses the function
function [ m ] = testmean( data )
global iterationcounter
m=mean(data);
iterationcounter=iterationcounter+1
end
The function should evaluate 100 samples, yet when I run the script, it will evaluate the function 101 times:
...
iterationcounter =
101
Elapsed time is 0.102291 seconds.
So why should one use this build-in Matlab function that appears to waste time?

bootstrp makes a call to bootfun (the function argument) for sanity checks (from the source code, in MATLAB 2015b, bootstrp.m, l.167 ff) :
% Sanity check bootfun call and determine dimension and type of result
try
% Get result of bootfun on actual data, force to a row.
bootstat = feval(bootfun,bootargs{:});
bootstat = bootstat(:)';
catch ME
m = message('stats:bootstrp:BadBootFun');
MEboot = MException(m.Identifier,'%s',getString(m));
ME = addCause(ME,MEboot);
rethrow(ME);
end
I would think that in a realistic application, N>>100, so the extra overhead is (much) less than a percent of he total runtime (not taking into account speed gains from possible parallelisation), so that should not matter that much?

Related

why does a*b*a take longer than (a'*(a*b)')' when using gpuArray in Matlab scripts?

The code below performs the operation the same operation on gpuArrays a and b in two different ways. The first part computes (a'*(a*b)')' , while the second part computes a*b*a. The results are then verified to be the same.
%function test
clear
rng('default');rng(1);
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c1=gather(transpose(transpose(a)*transpose(a*b)));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c2=gather(a*b*a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%end
However, computing (a'*(a*b)')' is roughly 4 times faster than computing a*b*a. Here is the output of the above script in R2018a on an Nvidia K20 (I've tried different versions and different GPUs with the similar behaviour).
>> test
time for (a'*(a*b)')': 0.43234s
time for a*b*a: 1.7175s
error = 2.0009e-11
Even more strangely, if the first and last lines of the above script are uncommented (to turn it into a function), then both take the longer amount of time (~1.7s instead of ~0.4s). Below is the output for this case:
>> test
time for (a'*(a*b)')': 1.717s
time for a*b*a: 1.7153s
error = 1.0914e-11
I'd like to know what is causing this behaviour, and how to perform a*b*a or (a'*(a*b)')' or both in the shorter amount of time (i.e. ~0.4s rather than ~1.7s) inside a matlab function rather than inside a script.
There seem to be an issue with multiplication of two sparse matrices on GPU. time for sparse by full matrix is more than 1000 times faster than sparse by sparse. A simple example:
str={'sparse*sparse','sparse*full'};
for ii=1:2
rng(1);
a=sprand(3000,3000,0.1);
b=sprand(3000,3000,0.1);
if ii==2
b=full(b);
end
a=gpuArray(a);
b=gpuArray(b);
tic
c=a*b;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
In your context, it is the last multiplication which does it. to demonstrate I replace a with a duplicate c, and multiply by it twice, once as sparse and once as full matrix.
str={'a*b*a','a*b*full(a)'};
for ii=1:2
%rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
rng(1)
c=sprand(3000,3000,0.1);
if ii==2
c=full(c);
end
a=gpuArray(a);
b=gpuArray(b);
c=gpuArray(c);
tic;
c1{ii}=a*b*c;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
disp(['error = ',num2str(max(max(abs(c1{1}-c1{2}))))])
I may be wrong, but my conclusion is that a * b * a involves multiplication of two sparse matrices (a and a again) and is not treated well, while using transpose() approach divides the process to two stage multiplication, in none of which there are two sparse matrices.
I got in touch with Mathworks tech support and Rylan finally shed some light on this issue. (Thanks Rylan!) His full response is below. The function vs script issue appears to be related to certain optimizations matlab applies automatically to functions (but not scripts) not working as expected.
Rylan's response:
Thank you for your patience on this issue. I have consulted with the MATLAB GPU computing developers to understand this better.
This issue is caused by internal optimizations done by MATLAB when encountering some specific operations like matrix-matrix multiplication and transpose. Some of these optimizations may be enabled specifically when executing a MATLAB function (or anonymous function) rather than a script.
When your initial code was being executed from a script, a particular matrix transpose optimization is not performed, which results in the 'res2' expression being faster than the 'res1' expression:
n = 2000;
a=gpuArray(sprand(n,n,0.01));
b=gpuArray(rand(n));
tic;res1=a*b*a;wait(gpuDevice);toc % Elapsed time is 0.884099 seconds.
tic;res2=transpose(transpose(a)*transpose(a*b));wait(gpuDevice);toc % Elapsed time is 0.068855 seconds.
However when the above code is placed in a MATLAB function file, an additional matrix transpose-times optimization is done which causes the 'res2' expression to go through a different code path (and different CUDA library function call) compared to the same line being called from a script. Therefore this optimization generates slower results for the 'res2' line when called from a function file.
To avoid this issue from occurring in a function file, the transpose and multiply operations would need to be split in a manner that stops MATLAB from applying this optimization. Separating each clause within the 'res2' statement seems to be sufficient for this:
tic;i1=transpose(a);i2=transpose(a*b);res3=transpose(i1*i2);wait(gpuDevice);toc % Elapsed time is 0.066446 seconds.
In the above line, 'res3' is being generated from two intermediate matrices: 'i1' and 'i2'. The performance (on my system) seems to be on par with that of the 'res2' expression when executed from a script; in addition the 'res3' expression also shows similar performance when executed from a MATLAB function file. Note however that additional memory may be used to store the transposed copy of the initial array. Please let me know if you see different performance behavior on your system, and I can investigate this further.
Additionally, the 'res3' operation shows faster performance when measured with the 'gputimeit' function too. Please refer to the attached 'testscript2.m' file for more information on this. I have also attached 'test_v2.m' which is a modification of the 'test.m' function in your Stack Overflow post.
Thank you for reporting this issue to me. I would like to apologize for any inconvenience caused by this issue. I have created an internal bug report to notify the MATLAB developers about this behavior. They may provide a fix for this in a future release of MATLAB.
Since you had an additional question about comparing the performance of GPU code using 'gputimeit' vs. using 'tic' and 'toc', I just wanted to provide one suggestion which the MATLAB GPU computing developers had mentioned earlier. It is generally good to also call 'wait(gpuDevice)' before the 'tic' statements to ensure that GPU operations from the previous lines don't overlap in the measurement for the next line. For example, in the following lines:
b=gpuArray(rand(n));
tic; res1=a*b*a; wait(gpuDevice); toc
if the 'wait(gpuDevice)' is not called before the 'tic', some of the time taken to construct the 'b' array from the previous line may overlap and get counted in the time taken to execute the 'res1' expression. This would be preferred instead:
b=gpuArray(rand(n));
wait(gpuDevice); tic; res1=a*b*a; wait(gpuDevice); toc
Apart from this, I am not seeing any specific issues in the way that you are using the 'tic' and 'toc' functions. However note that using 'gputimeit' is generally recommended over using 'tic' and 'toc' directly for GPU-related profiling.
I will go ahead and close this case for now, but please let me know if you have any further questions about this.
%testscript2.m
n = 2000;
a = gpuArray(sprand(n, n, 0.01));
b = gpuArray(rand(n));
gputimeit(#()transpose_mult_fun(a, b))
gputimeit(#()transpose_mult_fun_2(a, b))
function out = transpose_mult_fun(in1, in2)
i1 = transpose(in1);
i2 = transpose(in1*in2);
out = transpose(i1*i2);
end
function out = transpose_mult_fun_2(in1, in2)
out = transpose(transpose(in1)*transpose(in1*in2));
end
.
function test_v2
clear
%% transposed expression
n = 2000;
rng('default');rng(1);
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c1 = gather(transpose( transpose(a) * transpose(a * b) ));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
%% non-transposed expression
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c2 = gather(a * b * a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%% sliced equivalent
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
intermediate1 = transpose(a);
intermediate2 = transpose(a * b);
c3 = gather(transpose( intermediate1 * intermediate2 ));
disp(['time for split equivalent: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c3))))])
end
EDIT 2 I might have been right, see this other answer
EDIT: They use MAGMA, which is column major. My answer does not hold, however I will leave it here for a while in case it can help crack this strange behavior.
The below answer is wrong
This is my guess, I can not 100% tell you without knowing the code under MATLAB's hood.
Hypothesis: MATLABs parallel computing code uses CUDA libraries, not their own.
Important information
MATLAB is column major and CUDA is row major.
There is no such things as 2D matrices, only 1D matrices with 2 indices
Why does this matter? Well because CUDA is highly optimized code that uses memory structure to maximize cache hits per kernel (the slowest operation on GPUs is reading memory). This means a standard CUDA matrix multiplication code will exploit the order of memory reads to make sure they are adjacent. However, what is adjacent memory in row-major is not in column-major.
So, there are 2 solutions to this as someone writing software
Write your own column-major algebra libraries in CUDA
Take every input/output from MATLAB and transpose it (i.e. convert from column-major to row major)
They have done point 2, and assuming that there is a smart JIT compiler for MATLAB parallel processing toolbox (reasonable assumption), for the second case, it takes a and b, transposes them, does the maths, and transposes the output when you gather.
In the first case however, you already do not need to transpose the output, as it is internally already transposed and the JIT catches this, so instead of calling gather(transpose( XX )) it just skips the output transposition is side. The same with transpose(a*b). Note that transpose(a*b)=transpose(b)*transpose(a), so suddenly no transposes are needed (they are all internally skipped). A transposition is a costly operation.
Indeed there is a weird thing here: making the code a function suddenly makes it slow. My best guess is that because the JIT behaves differently in different situations, it doesn't catch all this transpose stuff inside and just does all the operations anyway, losing the speed up.
Interesting observation: It takes the same time in CPU than GPU to do a*b*a in my PC.

Matlab break iteration if line execution requires too much time

I run a Matlab code with an iteration loop, and during each iteration it sample random numbers and uses the function intlinprog. My issue is that, due to the large amount of data I provide to the intlinprog function and to the stochastic values I assign to part of its variables, some of the iterations take a really long time.
My code is more or less like this:
rounds = 1E3;
Total_PF = zeros(rounds,4893);
for i=1:rounds
i
cT = zeros (4894,1);
cT(4894,1) = 1;
xint = linspace(1,4893,4893);
xint = xint';
AT = rand(4,4894);
beT = ones(4,1);
lb = zeros(4894,1);
ub = ones (4894,1);
ub(4894,1) = Inf;
[x] = intlinprog(cT,xint,AT,beT,[],[],lb,ub);
Total_PF(i,:)= (x(1:length(x)-1)');
end
Now in the minimal working example I provided, all the iterations are quite fast, but in my real code, sometimes intlinprog takes really long time ( I mean hours) to do a single iteration.
Therefore, I was wondering: is there a way to break the intlinprog while the intlinprog line is being executed? I was thinking that it may be done by modifying the matlab function but first of all I do not know if I am allowed to do it, secondly I am afraid that may be very dangerous.
This is very difficult to do effectively.
You could try using a timer object to watch the value. However, since you are using inherent Matlab functions and not executing external functions, you could set a time limit value and utilize tic and toc in a while loop while you execute the intlinprog and checking the value of toc against your time limit and break your code if toc exceeds the limit.

Improve code / remove for-loop when using accumarray MATLAB

I have the following piece of code that is quite slow to compute the percentiles from a data set ("DATA"), because the input matrices are large ("Data" is approx. 500.000 long with 10080 unique values assigned from "Indices").
Is there a possibility/suggestions to make this piece of code more efficient? For example, could I somehow omit the for-loop?
k = 1;
for i = 0:0.5:100; % in 0.5 fractile-steps
FRACTILE(:,k) = accumarray(Indices,Data,[], #(x) prctile(x,i));
k = k+1;
end
Calling prctile again and again with the same data is causing your performance issues. Call it once for each data set:
FRACTILE=cell2mat(accumarray(Indices,Data,[], #(x) {prctile(x,[0:0.5:100])}));
Letting prctile evaluate your 201 percentiles in one call costs roughly as much computation time as two iterations of your original code. First because prctile is faster this way and secondly because accumarray is called only once now.

How do I run the Initial value x0 also in parallel

This code works fine but the plot is not correct because the optimization function fmincon will depend on the initial condition x0 and the number of iterations. For each value of alpha (a) and beta (b), I should run the optimization many times with different initial conditions x0 to verify that I am getting the right answer. More iterations might be required to get an accurate answer.
I want to be able to run the optimization with different initial conditions for x0, a and b.
Function file
function f = threestate2(x,a,b)
c1 = cos(x(1))*(cos(x(5))*(cos(x(9))+cos(x(11)))+cos(x(7))*(cos(x(9))-cos(x(11))))...
+cos(x(3))*(cos(x(5))*(cos(x(9))-cos(x(11)))-cos(x(7))*(cos(x(9))+cos(x(11))));
c2=sin(x(1))*(sin(x(5))*(sin(x(9))*cos(x(2)+x(6)+x(10))+sin(x(11))*cos(x(2)+x(6)+x(12)))...
+sin(x(7))*(sin(x(9))*cos(x(2)+x(8)+x(10))-sin(x(11))*cos(x(2)+x(8)+x(12))))...
+sin(x(3))*(sin(x(5))*(sin(x(9))*cos(x(4)+x(6)+x(10))-sin(x(11))*cos(x(4)+x(6)+x(12)))...
-sin(x(7))*(sin(x(9))*cos(x(4)+x(8)+x(10))+sin(x(11))*cos(x(4)+x(8)+x(12))));
f=(a*a-b*b)*c1+2*a*b*c2;
Main file
%x=[x(1),x(2),x(3),x(4),x(5),x(6),x(7),x(8),x(9),x(10),x(11),x(12)]; % angles;
lb=[0,0,0,0,0,0,0,0,0,0,0,0];
ub=[pi,2*pi,pi,2*pi,pi,2*pi,pi,2*pi,pi,2*pi,pi,2*pi];
x0=[pi/8;0;pi/3;0;0.7*pi;.6;0;pi/2;.5;0;pi/4;0];
xout=[];
fout=[];
options = optimoptions(#fmincon,'Algorithm','interior-point','TolX',10^-10,'MaxIter',1500);
a=0:0.01:1;
w=NaN(length(a));
for i=1:length(a)
bhelp=(1-a(i)*a(i));
if bhelp>0
b=sqrt(bhelp);
[x,fval]=fmincon(#(x)threestate2(x,a(i),b),x0,[],[],[],[],lb,ub,[],options);
w(i)=fval;
w(i)=-w(i);
B(i)=b;
else
w(i)=NaN;
B(i)=b;
end
end
%surface(b,a,w)
%view(3)
%meshc(b,a,w)
x=a.^2;
plot(x,w)
grid on
ylabel('\fontname{Times New Roman} S_{max}(\Psi_{gs})')
xlabel('\fontname{Times New Roman}\alpha^2')
%ylabel('\fontname{Times New Roman}\beta')
title('\fontname{Times New Roman} Maximum of the Svetlichny operator(\alpha|000>+\beta|111>)')
If you have the Matlab Parallel Toolbox, you can use the parfor which is like a regular loop but runs in parallel.
To use it you should make your all big messy script in a function. Assuming you store your initial conditions in A(i) and you store the results in B(i) you can use something like that:
parfor i=1:length(B)
B(i)=optimise(A(i));
end
If you don't have the toolbox there are some other ways (for example MEX files) but you basically have to manage the threads yourself so I wouldn't recommend it.

Iteration for convergence in Matlab without using a while loop

I have to iterate a process where I have an initial guess for the Mach number (M0). This initial guess will give me another guess for the Mach number by using two equations (Mn). Eventually, i want to iterate this process untill the error between M0 and Mn is small. I have the following piece of code and it actually works well with a while loop.
However, I am afraid that the while loop will take many iterations and computational time for certain inputs since this will be part of a bigger code which most likely will give unfeasible inputs for the while loop.
Therefore my question is the following. How can I iterate this process within Matlab without consulting a while loop? The code that I am implementing now is the following:
%% Input
gamma = 1.4;
theta = atan(0.315);
cpi = -0.732;
%% Loop
M0 = 0.2; %initial guess
Err = 100;
iterations = 0;
while Err > 0.5E-3
B = (1-(M0^2)*(1-M0*cpi))^0.5;
Mn = (((gamma+1)/2) * ((B+((1-cpi)^0.5)*sec(theta)-1)^2/(B^2 + (tan(theta))^2)) - ((gamma-1)/2) )^-0.5;
Err = abs(M0 - Mn);
M0 = Mn;
iterations=iterations+1;
end
disp(iterations) disp(Mn)
Many thanks
Since M0 is calculated in each iteration and you have trigonometric functions, you cannot use another way than iteration structures (i.e. while).
If you had a specific increase or decrease at M0, then you could initialize a vector of M0 and do vector calculations for B and Err.
But, with sec and tan this is not possible.
Another wat would be to use the parallel processing. But, since you change the M0 at each iteration then you cannot use the parfor loop.
As for a for loop, in MATLAB you need an array for for "command" argument (e.g. 1:10 or 1:length(x) or i = A, where A = 1:10 or A = [1:10;11:20]). Since you evaluate a condition and depending on the result of the evaluation you judge if you continue the execution or not, it seems that the while loop (or do while in another language) is the only way to go.
I think you need to clarify the issue. If it the issue you want to solve is that some inputs take a long time to calculate, it is not the while loop that takes the time, it is the execution of the code multiple times that causes it. Any method that loops through will be restricted by the time the block of code takes to execute multiplied by the number of iterations required to converge.
You can introduce something to stop at a certain number of iterationtions, conceptually:
While ((err > tolerance) && (numIterations < limit))
If you want an answer which does not require iterating over the code, this is akin to finding a closed form solution, and I suspect this does not exist.
Edit to add: by not exist I mean in a practical form which can be implemented in a more efficient way then iterating to a solution.