Best way to delete first row from matrix in matlab - matlab

I plan on deleting the first row of a matrix multiple times and was wondering what the best/most efficient way of doing this would be.
I know I can do something like this
M(1,:)=[]
or
M = M(2:end)
but I am not sure which is the best way or if there is another better way.

Hey just tested those 2 methods with tic and toc
This is the code I used:
A=rand(100,100000);
tic
a=A(2:end,:);
t1=toc
tic
A(1,:)=[];
t2=toc
and this is the result:
t1 =
0.0603
t2 =
0.0744
If you use longer columns it gets even more obvious:
A=rand(10000,100);
t1 =
0.0083
t2 =
0.0124
So saving the columns you want to keep seems to be faster.
Edit
It was commented that tic and toc are not "trustworthy" in the millisecond domain so it was recommended to use loops to run the code multiple times. But the result doesn't change.
A=rand(100,100000);
size_A=size(A);
tic
for k=1:1:100
A1=A;
A1=A1(2:end,:);
end
t1=toc
tic
for k=1:1:100
A1=A;
A1(1,:)=[];
end
t2=toc
this results in:
t1 =
7.5237
t2 =
15.2234

Generally it might be faster to keep what you want. Depending on the dimensions of your matrix however results may vary. Consider the following test case where two matrices are generated, A1, and B1, of dimension 100x100000 and 100000x100. The results are obtained from the profile viewer but tic toc measurements confirmed these results.
A1=rand(100,100000);
for ii=1:100
A=A1;
A=A(2:end,:);
end
for ii=1:100
A=A1;
A(1,:)=[];
end
B1=rand(100000,100);
for ii=1:100
B=B1;
B=B(2:end,:);
end
for ii=1:100
B=B1;
B(1,:)=[];
end
The results clearly show that the first case (keeping what you want on a matrix with lots of columns) is very slow actually.
There is no clear this or that is faster though. You should try to time for your situation!

Related

why does a*b*a take longer than (a'*(a*b)')' when using gpuArray in Matlab scripts?

The code below performs the operation the same operation on gpuArrays a and b in two different ways. The first part computes (a'*(a*b)')' , while the second part computes a*b*a. The results are then verified to be the same.
%function test
clear
rng('default');rng(1);
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c1=gather(transpose(transpose(a)*transpose(a*b)));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c2=gather(a*b*a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%end
However, computing (a'*(a*b)')' is roughly 4 times faster than computing a*b*a. Here is the output of the above script in R2018a on an Nvidia K20 (I've tried different versions and different GPUs with the similar behaviour).
>> test
time for (a'*(a*b)')': 0.43234s
time for a*b*a: 1.7175s
error = 2.0009e-11
Even more strangely, if the first and last lines of the above script are uncommented (to turn it into a function), then both take the longer amount of time (~1.7s instead of ~0.4s). Below is the output for this case:
>> test
time for (a'*(a*b)')': 1.717s
time for a*b*a: 1.7153s
error = 1.0914e-11
I'd like to know what is causing this behaviour, and how to perform a*b*a or (a'*(a*b)')' or both in the shorter amount of time (i.e. ~0.4s rather than ~1.7s) inside a matlab function rather than inside a script.
There seem to be an issue with multiplication of two sparse matrices on GPU. time for sparse by full matrix is more than 1000 times faster than sparse by sparse. A simple example:
str={'sparse*sparse','sparse*full'};
for ii=1:2
rng(1);
a=sprand(3000,3000,0.1);
b=sprand(3000,3000,0.1);
if ii==2
b=full(b);
end
a=gpuArray(a);
b=gpuArray(b);
tic
c=a*b;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
In your context, it is the last multiplication which does it. to demonstrate I replace a with a duplicate c, and multiply by it twice, once as sparse and once as full matrix.
str={'a*b*a','a*b*full(a)'};
for ii=1:2
%rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
rng(1)
c=sprand(3000,3000,0.1);
if ii==2
c=full(c);
end
a=gpuArray(a);
b=gpuArray(b);
c=gpuArray(c);
tic;
c1{ii}=a*b*c;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
disp(['error = ',num2str(max(max(abs(c1{1}-c1{2}))))])
I may be wrong, but my conclusion is that a * b * a involves multiplication of two sparse matrices (a and a again) and is not treated well, while using transpose() approach divides the process to two stage multiplication, in none of which there are two sparse matrices.
I got in touch with Mathworks tech support and Rylan finally shed some light on this issue. (Thanks Rylan!) His full response is below. The function vs script issue appears to be related to certain optimizations matlab applies automatically to functions (but not scripts) not working as expected.
Rylan's response:
Thank you for your patience on this issue. I have consulted with the MATLAB GPU computing developers to understand this better.
This issue is caused by internal optimizations done by MATLAB when encountering some specific operations like matrix-matrix multiplication and transpose. Some of these optimizations may be enabled specifically when executing a MATLAB function (or anonymous function) rather than a script.
When your initial code was being executed from a script, a particular matrix transpose optimization is not performed, which results in the 'res2' expression being faster than the 'res1' expression:
n = 2000;
a=gpuArray(sprand(n,n,0.01));
b=gpuArray(rand(n));
tic;res1=a*b*a;wait(gpuDevice);toc % Elapsed time is 0.884099 seconds.
tic;res2=transpose(transpose(a)*transpose(a*b));wait(gpuDevice);toc % Elapsed time is 0.068855 seconds.
However when the above code is placed in a MATLAB function file, an additional matrix transpose-times optimization is done which causes the 'res2' expression to go through a different code path (and different CUDA library function call) compared to the same line being called from a script. Therefore this optimization generates slower results for the 'res2' line when called from a function file.
To avoid this issue from occurring in a function file, the transpose and multiply operations would need to be split in a manner that stops MATLAB from applying this optimization. Separating each clause within the 'res2' statement seems to be sufficient for this:
tic;i1=transpose(a);i2=transpose(a*b);res3=transpose(i1*i2);wait(gpuDevice);toc % Elapsed time is 0.066446 seconds.
In the above line, 'res3' is being generated from two intermediate matrices: 'i1' and 'i2'. The performance (on my system) seems to be on par with that of the 'res2' expression when executed from a script; in addition the 'res3' expression also shows similar performance when executed from a MATLAB function file. Note however that additional memory may be used to store the transposed copy of the initial array. Please let me know if you see different performance behavior on your system, and I can investigate this further.
Additionally, the 'res3' operation shows faster performance when measured with the 'gputimeit' function too. Please refer to the attached 'testscript2.m' file for more information on this. I have also attached 'test_v2.m' which is a modification of the 'test.m' function in your Stack Overflow post.
Thank you for reporting this issue to me. I would like to apologize for any inconvenience caused by this issue. I have created an internal bug report to notify the MATLAB developers about this behavior. They may provide a fix for this in a future release of MATLAB.
Since you had an additional question about comparing the performance of GPU code using 'gputimeit' vs. using 'tic' and 'toc', I just wanted to provide one suggestion which the MATLAB GPU computing developers had mentioned earlier. It is generally good to also call 'wait(gpuDevice)' before the 'tic' statements to ensure that GPU operations from the previous lines don't overlap in the measurement for the next line. For example, in the following lines:
b=gpuArray(rand(n));
tic; res1=a*b*a; wait(gpuDevice); toc
if the 'wait(gpuDevice)' is not called before the 'tic', some of the time taken to construct the 'b' array from the previous line may overlap and get counted in the time taken to execute the 'res1' expression. This would be preferred instead:
b=gpuArray(rand(n));
wait(gpuDevice); tic; res1=a*b*a; wait(gpuDevice); toc
Apart from this, I am not seeing any specific issues in the way that you are using the 'tic' and 'toc' functions. However note that using 'gputimeit' is generally recommended over using 'tic' and 'toc' directly for GPU-related profiling.
I will go ahead and close this case for now, but please let me know if you have any further questions about this.
%testscript2.m
n = 2000;
a = gpuArray(sprand(n, n, 0.01));
b = gpuArray(rand(n));
gputimeit(#()transpose_mult_fun(a, b))
gputimeit(#()transpose_mult_fun_2(a, b))
function out = transpose_mult_fun(in1, in2)
i1 = transpose(in1);
i2 = transpose(in1*in2);
out = transpose(i1*i2);
end
function out = transpose_mult_fun_2(in1, in2)
out = transpose(transpose(in1)*transpose(in1*in2));
end
.
function test_v2
clear
%% transposed expression
n = 2000;
rng('default');rng(1);
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c1 = gather(transpose( transpose(a) * transpose(a * b) ));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
%% non-transposed expression
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c2 = gather(a * b * a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%% sliced equivalent
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
intermediate1 = transpose(a);
intermediate2 = transpose(a * b);
c3 = gather(transpose( intermediate1 * intermediate2 ));
disp(['time for split equivalent: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c3))))])
end
EDIT 2 I might have been right, see this other answer
EDIT: They use MAGMA, which is column major. My answer does not hold, however I will leave it here for a while in case it can help crack this strange behavior.
The below answer is wrong
This is my guess, I can not 100% tell you without knowing the code under MATLAB's hood.
Hypothesis: MATLABs parallel computing code uses CUDA libraries, not their own.
Important information
MATLAB is column major and CUDA is row major.
There is no such things as 2D matrices, only 1D matrices with 2 indices
Why does this matter? Well because CUDA is highly optimized code that uses memory structure to maximize cache hits per kernel (the slowest operation on GPUs is reading memory). This means a standard CUDA matrix multiplication code will exploit the order of memory reads to make sure they are adjacent. However, what is adjacent memory in row-major is not in column-major.
So, there are 2 solutions to this as someone writing software
Write your own column-major algebra libraries in CUDA
Take every input/output from MATLAB and transpose it (i.e. convert from column-major to row major)
They have done point 2, and assuming that there is a smart JIT compiler for MATLAB parallel processing toolbox (reasonable assumption), for the second case, it takes a and b, transposes them, does the maths, and transposes the output when you gather.
In the first case however, you already do not need to transpose the output, as it is internally already transposed and the JIT catches this, so instead of calling gather(transpose( XX )) it just skips the output transposition is side. The same with transpose(a*b). Note that transpose(a*b)=transpose(b)*transpose(a), so suddenly no transposes are needed (they are all internally skipped). A transposition is a costly operation.
Indeed there is a weird thing here: making the code a function suddenly makes it slow. My best guess is that because the JIT behaves differently in different situations, it doesn't catch all this transpose stuff inside and just does all the operations anyway, losing the speed up.
Interesting observation: It takes the same time in CPU than GPU to do a*b*a in my PC.

Improve code / remove for-loop when using accumarray MATLAB

I have the following piece of code that is quite slow to compute the percentiles from a data set ("DATA"), because the input matrices are large ("Data" is approx. 500.000 long with 10080 unique values assigned from "Indices").
Is there a possibility/suggestions to make this piece of code more efficient? For example, could I somehow omit the for-loop?
k = 1;
for i = 0:0.5:100; % in 0.5 fractile-steps
FRACTILE(:,k) = accumarray(Indices,Data,[], #(x) prctile(x,i));
k = k+1;
end
Calling prctile again and again with the same data is causing your performance issues. Call it once for each data set:
FRACTILE=cell2mat(accumarray(Indices,Data,[], #(x) {prctile(x,[0:0.5:100])}));
Letting prctile evaluate your 201 percentiles in one call costs roughly as much computation time as two iterations of your original code. First because prctile is faster this way and secondly because accumarray is called only once now.

Permuting a vector efficiently in matlab

I want to make 1000 random permutations of a vector in matlab. I do it like this
% vector is A
num_A = length(A);
for i=1:1000
n = randperm(num_A);
A = A(n); % This is one permutation
end
This takes like 73 seconds. Is there any way to do it more efficiently?
Problem 1 - Overwriting the original vector inside loop
Each time A = A(n); will overwrite A, the input vector, with a new permutation. This might be reasonable since anyway you don't need the order but all the elements in A. However, it's extremely inefficient because you have to re-write a million-element array in every iteration.
Solution: Store the permutation into a new variable -
B(ii, :) = A(n);
Problem 2 - Using i as iterator
We at Stackoverflow are always telling serious Matlab users that using i and j as interators in loops is absolutely a bad idea. Check this answer to see why it makes your code slow, and check other answers in that page for why it's bad.
Solution - use ii instead of i.
Problem 3 - Using unneccessary for loop
Actually you can avoid this for loop at all since the iterations are not related to each other, and it will be faster if you allow Matlab do parallel computing.
Solution - use arrayfun to generate 1000 results at once.
Final solution
Use arrayfun to generate 1000 x num_A indices. I think (didn't confirm) it's faster than directly accessing A.
n = cell2mat(arrayfun(#(x) randperm(num_A), 1:1000', 'UniformOutput', false)');
Then store all 1000 permutations at once, into a new variable.
B = A(n);
I found this code pretty attractive. You can replace randperm with Shuffle. Example code -
B = Shuffle(repmat(A, 1000, 1), 2);
A = perms(num_A)
A = A(1:1000)
Perms returns all the different permutations, just take the first 1000 permutations.

check if matrixs elements are all not equal or deferent elements

I have the following vector v = [r1 r2 r3 r4 r5 .... rn], with r integer numbers.
I want to check:
if r1 not equal to r2 not equal to r3 ... not equal to rn (all different of each other):
print v
else (some elements are equal and other not equal):
print the index of the equal elements.
I recommend using the unique command. It will return the values of all unique values.
If you want to check that all values in your matrix v are unique, I would use the following command:
everything_is_unique = length(unique(v))==length(v);
You can also return the indices of the equal elements.
See the documentation on unique for more information.
Alternate to using unique you can also sort the elements and check whether they are all different:
all(diff(sort(v)))
By using sort with more input arguements you could get the indices you are looking for.
I'll give one more solution, and a comparison of all methods tried so far.
My solution is based on the observation that when using builtin functions like sort() or unique(), you lose opportunities for early escapes. That is, the sort() will have to sort the vector completely before you can continue with your algorithm, even though this is not required if two equal values are already detected inside the sort.
Therefore, I simply iterate through the array, and compare the current value to all following values using any(). This works around some of these issues, and works well enough for a lot of cases.
However, the worst case complexity is O(N²), which is a hell of a lot worse than sort(), which has only O(N·log(N)). So as usual, it all depends on context :)
Trying this:
clc
N = 1e4;
% Zigzag's solution
tic
for ii = 1:1e2
v = randi(N, N,1);
length(unique(v))==length(v);
end
toc
% Dennis Jaheruddin's solution
tic
for ii = 1:1e4
v = randi(N, N,1);
all(diff(sort(v)));
end
toc
% My solution
tic
for ii = 1:1e4
v = randi(N, N,1);
cond = true;
for jj = 1:numel(v)
if any(v(jj) == v(jj+1:end))
cond = false;
break;
end
end
end
toc
The random numbers are generated inside the loops to ensure a variety of different cases will come by. Results on my PC:
Elapsed time is 16.787976 seconds. % unique
Elapsed time is 14.284696 seconds. % sort + diff
Elapsed time is 5.376655 seconds. % loop + any
So explicit looping (provided feature accel is on) with early exit is actually almost three times faster than the standard "vectorized" approach :)
PS - I also tried to nest another loop to try and improve having to compare all value before detecting equal values (first v(jj)==v(jj+1:end) is evaluated completely, before any() can start doing its job), but here, the overheads really start to get in the way (or the JIT is not coping well enough with this sort of thing, I don't know). In theory, this should be even faster of course, but unfortunately, not in MATLAB :)
However, change the random number generation
v = randi(N, N,1);
into
v = randi(N*N, N,1);
and the results are quite different:
Elapsed time is 0.162625 seconds. % unique
Elapsed time is 0.147369 seconds. % sort + diff
Elapsed time is 30.767247 seconds. % loop + any
Here I used only 100 iterations instead of 10.000, for obvious reasons :)

MATLAB matrix multiplication vs for loop for each column

When multiplying two matrices, I tried the following two options:
1)
res = X*A;
2)
for i = 1:size(A,2)
res(:,i) = X*A(:,i);
end
I preallocated memory for res in both. And surprisingly, I found option 2 to be faster.
Can someone explain how this is so?
edit:
I tried
K=10000;
clear t1 t2
t1=zeros(K,1);
t2=zeros(K,1);
for k=1:K
clear res
x = rand(100,100);
a = rand(100,100);
tic
res = x*a;
t1(k) = toc;
end
for k=1:K
clear res2
res2 = zeros(100,100);
x = rand(100,100);
a = rand(100,100);
tic
for i = 1:100
res2(:,i) = x*a(:,i);
end
t2(k) = toc;
end
I run both codes in a loop 1000 times. In average (but not always) the first vectorized code was 3-4 times faster. I cleared the result variables and preallocated before starting timer.
x = rand(100,100);
a = rand(100,100);
K=1000;
clear t1 t2
t1=zeros(K,1);
t2=zeros(K,1);
for k=1:K
clear res
tic
res = x*a;
t1(k) = toc;
end
for k=1:K
clear res2
res2 = zeros(100,100);
tic
for i = 1:100
res2(:,i) = x*a(:,i);
end
t2(k) = toc;
end
So, never make a timing conclusion based on a single run.
I believe I can chime in on the variation in timings between the two methods, as well as why people are getting different relative speeds.
Before Matlab version 2008a (or a version near that release), for loops took a major hit in any Matlab code because the interpreter (a layer between the very readable script and a lower level implementation of the code) would have to re-interpret the code each time through the for loop.
Since that release, the interpreter has gotten progressively better so, when running a modern version of Matlab, the interpreter can look at your code and say "Ah ha! I know what he is doing, let me optimize it just a bit" and avoid the hit it would otherwise take by reinterpreting the code.
I would expect the two ways of performing matrix multiplies to evaluate in the same amount of time, why the for loop implementation runs faster is because of some detail in the optimizations of the interpreter that us mere mortals are not privy to know.
One broad lesson we should take from this, is not all versions are equal. I do work on a couple of bleeding edge cases using two Matlab add ons, the SimBiology and the Parallel Computing Toolboxes, both of which (especially if you want them to work together) are version dependent in speed of execution, and from time to time other stability issues. As such, I keep the three most recent releases of Matlab, will test that I get the same answers out of each version, and I'll occasionally roll back to an earlier version if I find issues with some features. This is probably overkill for most people, but gives you an idea of version differences.
Hope this helps.
Edits:
To clarify, code vectorization is still important. But given a script like:
x_slow = zeros(1,1e5);
x_fast = zeros(1,1e5);
tic;
for i=1:1e5
x_slow(i) = log(i);
end
time_slow = toc; % evaluates for me in .0132 seconds
tic;
x_fast = log(1:1e5);
time_fast = toc; % evaluates for me in .0055 seconds
The disparity between time_slow and time_fast has reduced in the past several versions based on improvements in the interpreter. The example I saw I believe was on 2000a vs. 2008b, but that's subject to my recollection.
There is something else that might be going on that was addressed by Oli and Yuk. There is often a difference between the time_1 and time_2 in:
tic; x = log(1:1e5); time_1 = toc
tic; x = log(1:1e5); time_2 = toc
So the test of one million evaluations vs. one evaluation is valuable, depending on where in memory x is (in cache or no).
Hope this helps again.
This may well be an effect of caching. a is already in the cache by the time you do the second version, so it has an advantage. Try creating an independent set of inputs to make it fair. Also, it's probably better to measure the time of e.g. 1 million iterations of this, in order to eliminate typical variations due to outside effects.
It looks to me that you are not multiplying matrix properly, you need to sum all the products from ith row of X matrix and jth column of A matrix, that might be a reason.
Look here to see how it's done.