Related
People have asked similar questions before but none has a satisfactory answer. I'm trying to solve Lindblad Master Equation and the matrix size I'm trying to simulate are of order 10000 x 10000. But the problem is with exponentiation of the matrix, which is consuming a lot of RAM.
The MATLAB and Python expm() function take around 20s and 80s for a matrix of size 1000 x 1000 respectively. The code I've shown below.
pd = makedist('Normal');
N = 1000;
r = random(pd ,[N, N]);
t0 = tic;
r = expm(r);
t_total = toc(t0);
The problem comes when I try to do the same for a matrix of size 10000 x 10000. Whenever I apply expm(), the RAM usage grows and it take all the RAM and SWAP memory on my PC (I've 128 GB RAM and 64 Core CPU) and it's same in case of both MATLAB and Scipy. I don't understand what is taking so much RAM and how can I efficiently rum expm() or if it is not possible at all? Even if I could do it on any other language efficiently it would be really helpful!
I am having Out of Memory error while trying to solve a certain linear equation (I will put the code below). Since I am used to coding in C where you have every control over the objects you create I am wondering if I am using matlab inefficiently. Here is the relevant part of the code
myData(n).AMatrix = sparse(fscanf(fid2, '%f', [2*M, 2*M]));
myData(n).AMatrix = transpose(myData(n).AMatrix);
%Read the covariance^2 matrix
myData(n).CovMatrix = sparse(fscanf(fid2, '%f', [2*M,2*M]));
myData(n).CovMatrix = reshape(myData(n).CovMatrix, [4*M*M,1]);
%Kronecker sum of A with itself
I=sparse(eye(2*M));
myData(n).AA=kron( I, myData(n).AMatrix)+kron( myData(n).AMatrix,I);
myData(n).AMatrix=[];
I=[];
%Solve (A+A)x = Vec(CovMatrix)
x=myData(n).CovMatrix\myData(n).AA;
Trying to use this code I get the error
Error using \
Out of memory. Type HELP MEMORY for your options.
Error in COV (line 62)
x=myData(n).CovMatrix\myData(n).AA;
Before this piece of code I only open some files (which contain two 100x100 array of floats) so I dont think they contribute to this error. The element AMatrix is a 100 x 100 array. So the linear equation in question has dimensions 10000 x 10000. Also AA has one dimensional kernel, I dont know if this affects the numerical computations. Later I project the obtained solution to the orthogonal complement of the kernel to get the "good" solution but it comes after the error. For people who are familiar with it this is just a solution to the Lyapunov equation AX + XA = Cov. The matrix A is sparse, it has 4 50x50 sublocks one of which is all zeros, the other is identity, the other is diagonal and the other has less than 1000 non-zero elements. The matrix CovMatrix is diagonal with 50 non-zero elements in the diagonal.
The problem is at the moment I can only do the calculations on a small personal computer with 2GB RAM with 2.5-6GB of virtual memmory. When I run memmory on matlab it gives
>> memory
Maximum possible array: 311 MB (3.256e+08 bytes) *
Memory available for all arrays: 930 MB (9.749e+08 bytes) **
Memory used by MATLAB: 677 MB (7.102e+08 bytes)
Physical Memory (RAM): 1931 MB (2.025e+09 bytes)
I am not very knowledgable when it comes to memory so I am open to even simple advices. Thanks.
Complex functions usually allocate temp memory during computation. 10000x10000 looks quite large if a temp dense matrix of such size is allocated during the computation. You could try a few smaller problem sizes and find out the upper limit of your current computer.
I have a code that works as non linear system equation solver.
I have so much trouble with a command that goes like this:
newt[0]:=[-2.,20]:
I don't know what does that dot works there!
I thought it may be for showing that it is -2.0, but there is no reason to use that when by default -2 = -2.0.
Can anyone help me with this?
The dot forces float calculations
It is not correct that by default -2 = -2.0. There is a very big difference for Maple in how it calculates: if you use -2 it calculates exacts (arithmetic expressions) while -2.0 tells Maple to calculate with floats (numerical expressions).
The two expressions -2.*sqrt(5) and -2*sqrt(5.) are quite different in how Maple handles them, if you notice the float position! For the first example, the square root is calculated arithmetically, while in the second example it is calculated numerically!
This can be a very big deal for some calculations; both with regards to speed and precision, and should be considered carefully when one wants to do complicated computations.
Speed example: Calculate exp(x) for x = 1,2,...,50000. (Arithmetic > numerical)
CodeTools:-Usage(seq(exp(x),x=1..50000)): # Arithmetic
memory used=19.84MiB, alloc change=0 bytes, cpu time=875.00ms,
real time=812.00ms, gc time=265.62ms
CodeTools:-Usage(seq(exp(1.*x),x=1..50000)): # Numerical
memory used=292.62MiB, alloc change=0 bytes, cpu time=9.67s,
real time=9.45s, gc time=1.09s
Notice especially the huge difference in memory used.
This is an example of when using floats gives worse performance. On the contrary, if we are just approximating anyways, numerical approximation is much faster.
Approximate exp(1) (numerical > arithmetic)
CodeTools:-Usage(seq((1+1/x)^x,x=1..20000)): # Arithmetic
memory used=0.64GiB, alloc change=0 bytes, cpu time=39.05s,
real time=40.92s, gc time=593.75ms
CodeTools:-Usage(seq((1+1./x)^x,x=1..20000)): # Numerical
memory used=56.17MiB, alloc change=0 bytes, cpu time=1.06s,
real time=1.13s, gc time=0ns
Precision example: For precision, things can go very wrong if one is not careful.
f:=x->(Pi-x)/sin(x);
limit(f(x),x=Pi); # Arithmetic returns 1 (true value)
limit(f(x),x=Pi*1.); # Numerical returns 0 (wrong!!!)
After a little working with that I finally found what it does!
short answer: it calculate the result of expression where those 2 integers are inputs.
extended answer:(example)
given 2 functions, we want to calculate Jacobin matrix for this equation system
with(linalg);
with(plots);
f := proc (x, y) -> (1/64)*(x-11)^2-(1/100)*(y-7)^2-1;
g := proc (x, y) -> (x-3)^2+(y-1)^2-400;
then we put functions in vector:
F:=(x, y) -> vector([f(x,y),g(x,y)]);
F(-2 ,20)
F(-2.,20)
result will be this:
[-79/1600 -14]
[-0.049375000 -14]
I'm testing svd in Matlab R2014a and it seems that there is no CPU vs GPU speedup. I'm using a GTX 460 card and a Core 2 duo E8500.
Here is my code:
%test SVD
n=10000;
%host
Mh= rand(n,1000);
tic
%[Uh,Sh,Vh]= svd(Mh);
svd(Mh);
toc
%device
Md = gpuArray.rand(n,1000);
tic
%[Ud,Sd,Vd]= svd(Md);
svd(Md);
toc
Also, the run times are different from run to run, but the CPU and GPU versions are about the same. Why there is no speedup?
Here are some tests
for i=1:10
clear;
m= 10000;
n= 100;
%host
Mh= rand(m,n);
tic
[Uh,Sh,Vh]= svd(Mh);
toc
%device
Md = gpuArray.rand(m,n);
tic
[Ud,Sd,Vd]= svd(Md);
toc
end
>> test_gpu_svd
Elapsed time is 43.124130 seconds.
Elapsed time is 43.842277 seconds.
Elapsed time is 42.993283 seconds.
Elapsed time is 44.293410 seconds.
Elapsed time is 42.924541 seconds.
Elapsed time is 43.730343 seconds.
Elapsed time is 43.125938 seconds.
Elapsed time is 43.645095 seconds.
Elapsed time is 43.492129 seconds.
Elapsed time is 43.459277 seconds.
Elapsed time is 43.327012 seconds.
Elapsed time is 44.040959 seconds.
Elapsed time is 43.242291 seconds.
Elapsed time is 43.390881 seconds.
Elapsed time is 43.275379 seconds.
Elapsed time is 43.408705 seconds.
Elapsed time is 43.320387 seconds.
Elapsed time is 44.232156 seconds.
Elapsed time is 42.984002 seconds.
Elapsed time is 43.702430 seconds.
for i=1:10
clear;
m= 10000;
n= 100;
%host
Mh= rand(m,n,'single');
tic
[Uh,Sh,Vh]= svd(Mh);
toc
%device
Md = gpuArray.rand(m,n,'single');
tic
[Ud,Sd,Vd]= svd(Md);
toc
end
>> test_gpu_svd
Elapsed time is 21.140301 seconds.
Elapsed time is 21.334361 seconds.
Elapsed time is 21.275991 seconds.
Elapsed time is 21.582602 seconds.
Elapsed time is 21.093408 seconds.
Elapsed time is 21.305413 seconds.
Elapsed time is 21.482931 seconds.
Elapsed time is 21.327842 seconds.
Elapsed time is 21.120969 seconds.
Elapsed time is 21.701752 seconds.
Elapsed time is 21.117268 seconds.
Elapsed time is 21.384318 seconds.
Elapsed time is 21.359225 seconds.
Elapsed time is 21.911570 seconds.
Elapsed time is 21.086259 seconds.
Elapsed time is 21.263040 seconds.
Elapsed time is 21.472175 seconds.
Elapsed time is 21.561370 seconds.
Elapsed time is 21.330314 seconds.
Elapsed time is 21.546260 seconds.
Generally SVD is a difficult to paralellize routine. You can check here that with a high end Tesla card, the speedup is not very impressive.
You have a GTX460 card - Fermi architecture. The card is optimized for gaming (single precision computations), not HPC (double precision computation). The Single Precision / Double Precision throughput ratio is 12. So the card has 873 GFLOPS SP / 72 GFLOPS DP. Check here.
So if the Md array uses double precision elements, then the computation on it would be rather slow. Also there's a high chance that when calling the CPU routine, all CPU cores will get utilized, reducing the possible gain of running the routine on the GPU. Plus, in the GPU run you pay time for transferring the buffer to the device.
Per Divakar's suggestion, you could use Md = single(Md) to convert your array to single precision and run the benchmark again. You can try and go with a bigger dataset size to see if something changes. I don't expect to much gain for this routine on your GPU.
Update 1:
After you posted the results, I saw that the DP/SP time ratio is 2. On the CPU side this is normal, because you can fit 2 times less double values in SSE registers. However, a ratio of only 2 on the GPU side means that the gpu code does not make best use of the SM cores - because the theoretical ratio is 12. In other words, I would have expected much better SP performance for an optimized code, compared to DP. It seems that this is not the case.
As VAndrei has already stated, the SVD is an algorithm which is difficult to parallelize.
Your main problem is the size of your matrix. The performance of the SVD drops rapidly with a growing matrix size. So your main goal should be to reduce the size of the matrix.
This can be accomplished using Gaussian normal equations (which is basically a reduction of an overdetermined linear system in the least-squares sense).
This can be done by simply multiplying the transpose onto the matrix:
MhReduced = Mh' * Mh;
This reduces your matrix to the size of cols*cols (if cols is the number of columns of Mh). Then you just call [U,S,V] = svd(MhReduced);
Note: Using this method may yield singular vectors with opposite sign (just important if you're comparing these methods).
If your matix is well-conditioned this should work without problems. However, in case of an ill-conditioned matrix, this method may fail to produce a usable result, whereas applying SVD directly could still yield a usable result due to SVD's robustness.
This should increase your performance immensly, at least with matrices big enough. Another advantage is that you can use much larger matrices. You'll probably won't have to use the GPU at all (since either matrices are so big that copying to GPU costs too much or after reduction the matrix is so small that the speedup of the GPU won't be big enough).
Also note that a large chunk of performance is lost, if you use return values. If you're only interested in the performance of the SVD caluclation, don't take any return values. If you are only interested in the "solution vector", just get V (and access the last column): [~,~, V] = svd(Mh);.
EDIT:
I've looked at your sample code, but I'm not sure what it is, you are calculating. Also I realized that it's rather hard to understand what I did with A'*A, so I will explain in detail.
Given a linear system with A*x=b, A denoting the coefficient matrix
with m rows and n cols, x the solution vector and b the constant vector (both with m rows), a solution can be calculated as follows:
if A is square (m=n): x = A^-1 * b,
if A is not square (m!=n, m > n):
A * x = b
A'* A * x = A' * b
x = (A' * A)^-1 * A'*b
A" = (A'*A)^-1 * A' is typically called pseudo-inverse. However this calculation does influence the condition number of the matrix negatively. A solution to this problem is using a singular value decomposition (SVD).
If USV = svd(A) denotes the results of the SVD, the pseudo-inverse is given by VS"U', with S" is formed by taking the inverse of the non-zero elements of S.
So A" = VS"U'.
x = A"*b
However since a SVD is rather costly, especially with large matrices. If matrix A is well-conditioned and very precicse results are not necessarily required (we're talking 1e-13 or 1e-14), the much faster approach by calculating the peseudo-inverse via (A'*A)^-1 * A can be used.
If your case actually is A*x=0, just use a SVD and read the last column vector from V, it is the solution.
If you use the SVD not to solve a linear system but for the results of U and S (as your example suggests), I'm not sure what I've posted will help you.
Sources:
1, 2, 3
Here is some sample code for you to test. Test it with large matrices, you will see that using (A'*A)^-1 * A' is much faster than the alternatives.
clear all
nbRows = 30000;
nbCols = 100;
% Matrix A
A = rand(nbRows,nbCols);
% Vector b
b = rand(nbRows,1);
% A*x=b
% Solve for x, using SVD
% [U,S,V]=svd(A,0);
% x= V*((U'*b)./diag(S))
tic
[U1,S1,V1]=svd(A,0);
x1= V1*((U1'*b)./diag(S1));
toc
tic
[U1,S1,V1]=svd(A,0);
x2 = V1*inv(S1)*U1'*b;
toc
% Solve for x, using manual pseudo-inverse
% A*x=b
% A'*A*x = A'*b
% x = (A'*A)^-1 * A'*b
tic
x3 = inv(A'*A) * A'*b;
toc
% Solve for x, let Matlab decide how (most likely SVD)
tic
x4 = A\b;
toc
The issue
First of all, I have replicated your issue in Matlab2016b using the following code:
clear all
close all
clc
Nrows = 2500;
Ncols = 2500;
NumTests = 10;
h_A = rand(Nrows, Ncols);
d_A = gpuArray.rand(Nrows, Ncols);
timingCPU = 0;
timingGPU = 0;
for k = 1 : NumTests
% --- Host
tic
[h_U, h_S, h_V] = svd(h_A);
% h_S = svd(h_A);
timingCPU = timingCPU + toc;
% --- Device
tic
[d_U, d_S, d_V] = svd(d_A);
% d_S = svd(d_A);
timingGPU = timingGPU + toc;
end
fprintf('Timing CPU = %f; Timing GPU = %f\n', timingCPU / NumTests, timingGPU / NumTests);
By the above code, it is possible to either compute the singular values only or compute the full SVD including the singular vectors. It is possible also to compare the different behavior of the CPU and GPU versions of the SVD code.
The timing is reported in the following table (timing in s; Intel Core i7-6700K CPU # 4.00GHz, 16288 MB, Max threads(8), GTX 960):
Sing. values only | Full SVD | Sing. val. only | Full
| | |
Matrix size CPU GPU | CPU GPU | |
| | |
200 x 200 0.0021 0.043 | 0.0051 0.024 | 0.098 | 0.15
1000 x 1000 0.0915 0.3 | 0.169 0.458 | 0.5 | 2.3
2500 x 2500 3.35 2.13 | 4.62 3.97 | 2.9 | 23
5000 x 5000 5.2 13.1 | 26.6 73.8 | 16.1 | 161
The first 4 columns refer to a comparison between the CPU and GPU Matlab versions of the svd routine when it is used to calculate the singular values only or the full SVD. As it can be seen, the GPU version can be significantly slower than the GPU one. The motivation has been already pointed out in some answers above: there is an inherent difficulty to parallelize the SVD computation.
Using cuSOLVER?
At this point, the obvious question is: can we get some speedup with cuSOLVER? Indeed, we could use mexFiles to make the cuSOLVER routines run under Matlab. Unfortunately, the situation with cuSOLVER is even worse, as it can be deduced from the last two columns of the above table. Such columns report the timing of the codes at Singular values calculation only with CUDA and Parallel implementation for multiple SVDs using CUDA using cusolverDnSgesvd for the singular values only calculation and full SVD calculation, respectively. As it can be seen, cuSOLVER's cusolverDnSgesvd performs even worser than Matlab, if one takes into account that it deals with single precision, while Matlab with double precision.
The motivation for this behavior is further explained at cusolverDnCgesvd performance vs MKL where Joe Eaton, manager of cuSOLVER library, says
I understand the confusion here. We do provide a decent speedup for
LU, QR and LDL^t factorizations, which is what we would like to say
for SVD as well. Our purpose with cuSOLVER is to provide dense and
sparse direct solvers as part of the CUDA toolkit for the first time;
we have to start somewhere. Since CULA is no longer supported, we felt
it was urgent to get some functionality into the hands of developers
in CUDA 7.0. Since CUDA runs on more that x86 host CPUs these days,
cuSOLVER fills a need where there is no MKL. That being said, we can
do better with SVD, but it will have to wait for the next CUDA
release, priorities and timelines being tight already.
Using other libraries
At this point, other possibilities are using other libraries like
CULA;
MAGMA;
ArrayFire.
CULA is not offered for free, so I have not tried it.
I had some installation issues with MAGMA dependencies, so I have not investigated this point further (disclaimer: I expect that, with some more time, I could be able to solve such issues).
I then finally ended up with using ArrayFire.
Using ArrayFire, I had the following timing for the full SVD computation:
200 x 200 0.036
1000 x 1000 0.2
2500 x 2500 4.5
5000 x 5000 29
As it can be seen, the timing is slightly higher, but now comparable, to the CPU case.
Here is the ArrayFire code:
#include <arrayfire.h>
#include <cstdio>
#include <cstdlib>
#include <fstream>
using namespace af;
int main(int argc, char *argv[])
{
const int N = 1000;
try {
// --- Select a device and display arrayfire info
int device = argc > 1 ? atoi(argv[1]) : 0;
af::setDevice(device);
af::info();
array A = randu(N, N, f64);
af::array U, S, Vt;
// --- Warning up
timer time_last = timer::start();
af::svd(U, S, Vt, A);
S.eval();
af::sync();
double elapsed = timer::stop(time_last);
printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);
time_last = timer::start();
af::svd(U, S, Vt, A);
S.eval();
af::sync();
elapsed = timer::stop(time_last);
printf("elapsed time using start and stop = %g ms \n", 1000.*elapsed);
}
catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
throw;
}
return 0;
}
I have tried to parallelize SVD on my laptop equipped with GTX 460 for over one months, which was also a part of my undergraduate thesis, I did so many experiments that I later discovered that MATLAB is extremely fast and outperforms my code, by the way, I used one side Jacobi, and I have not yet seen any paper that reveals an algorithm faster than svd of MATLAB. On GPU, the time cost of memory copy can be very high if you are not using an elegant model, I refer you to read more about CUDA.
If you need any help, please contact me.
Is there a vectorised way to do the following? (shown by an example):
input_lengths = [ 1 1 1 4 3 2 1 ]
result = [ 1 2 3 4 4 4 4 5 5 5 6 6 7 ]
I have spaced out the input_lengths so it is easy to understand how the result is obtained
The resultant vector is of length: sum(lengths). I currently calculate result using the following loop:
result = ones(1, sum(input_lengths ));
counter = 1;
for i = 1:length(input_lengths)
start_index = counter;
end_index = counter + input_lengths (i) - 1;
result(start_index:end_index) = i;
counter = end_index + 1;
end
EDIT:
I can also do this using arrayfun (although that is not exactly a vectorised function)
cell_result = arrayfun(#(x) repmat(x, 1, input_lengths(x)), 1:length(input_lengths), 'UniformOutput', false);
cell_result : {[1], [2], [3], [4 4 4 4], [5 5 5], [6 6], [7]}
result = [cell_result{:}];
result : [ 1 2 3 4 4 4 4 5 5 5 6 6 7 ]
A fully vectorized version:
selector=bsxfun(#le,[1:max(input_lengths)]',input_lengths);
V=repmat([1:size(selector,2)],size(selector,1),1);
result=V(selector);
Downside is, the memory usage is O(numel(input_lengths)*max(input_lengths))
Benchmark of all solutions
Following the previous benchmark, I group all solutions given here in a script and run it a few hours for a benchmark. I've done this because I think it's good to see what is the performance of each proposed solution with the input lenght as parameter - my intention is not here to put down the quality of the previous one, which gives additional information about the effect of JIT. Moreover, and every participant seems to agree with that, quite a good work was done in all answers, so this great post deserves a conclusion post.
I won't post the code of the script here, this is quite long and very uninteresting. The procedure of the benchmark is to run each solution for a set of different lengths of input vectors: 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, 1000000. For each input length, I've generated a random input vector based on Poisson law with parameter 0.8 (to avoid big values):
input_lengths = round(-log(1-rand(1,ILen(i)))/poisson_alpha)+1;
Finally, I average the computation times over 100 runs per input length.
I've run the script on my laptop computer (core I7) with Matlab R2013b; JIT is activated.
And here are the plotted results (sorry, color lines), in a log-log scale (x-axis: input length; y-axis: computation time in seconds):
So Luis Mendo is the clear winner, congrats!
For anyone who wants the numerical results and/or wants to replot them, here they are (cut the table into 2 parts and approximated to 3 digits, for a better display):
N 10 20 50 100 200 500 1e+03 2e+03
-------------------------------------------------------------------------------------------------------------
OP's for-loop 8.02e-05 0.000133 0.00029 0.00036 0.000581 0.00137 0.00248 0.00542
OP's arrayfun 0.00072 0.00117 0.00255 0.00326 0.00514 0.0124 0.0222 0.047
Daniel 0.000132 0.000132 0.000148 0.000118 0.000126 0.000325 0.000397 0.000651
Divakar 0.00012 0.000114 0.000132 0.000106 0.000115 0.000292 0.000367 0.000641
David's for-loop 9.15e-05 0.000149 0.000322 0.00041 0.000654 0.00157 0.00275 0.00622
David's arrayfun 0.00052 0.000761 0.00152 0.00188 0.0029 0.00689 0.0122 0.0272
Luis Mendo 4.15e-05 4.37e-05 4.66e-05 3.49e-05 3.36e-05 4.37e-05 5.87e-05 0.000108
Bentoy13's cumsum 0.000104 0.000107 0.000111 7.9e-05 7.19e-05 8.69e-05 0.000102 0.000165
Bentoy13's sparse 8.9e-05 8.82e-05 9.23e-05 6.78e-05 6.44e-05 8.61e-05 0.000114 0.0002
Luis Mendo's optim. 3.99e-05 3.96e-05 4.08e-05 4.3e-05 4.61e-05 5.86e-05 7.66e-05 0.000111
N 5e+03 1e+04 2e+04 5e+04 1e+05 2e+05 5e+05 1e+06
-------------------------------------------------------------------------------------------------------------
OP's for-loop 0.0138 0.0278 0.0588 0.16 0.264 0.525 1.35 2.73
OP's arrayfun 0.118 0.239 0.533 1.46 2.42 4.83 12.2 24.8
Daniel 0.00105 0.0021 0.00461 0.0138 0.0242 0.0504 0.126 0.264
Divakar 0.00127 0.00284 0.00655 0.0203 0.0335 0.0684 0.185 0.396
David's for-loop 0.015 0.0286 0.065 0.175 0.3 0.605 1.56 3.16
David's arrayfun 0.0668 0.129 0.299 0.803 1.33 2.64 6.76 13.6
Luis Mendo 0.000236 0.000446 0.000863 0.00221 0.0049 0.0118 0.0299 0.0637
Bentoy13's cumsum 0.000318 0.000638 0.00107 0.00261 0.00498 0.0114 0.0283 0.0526
Bentoy13's sparse 0.000414 0.000774 0.00148 0.00451 0.00814 0.0191 0.0441 0.0877
Luis Mendo's optim. 0.000224 0.000413 0.000754 0.00207 0.00353 0.00832 0.0216 0.0441
Ok, I've added another solution to the list ... I could not prevent myself to optimize the best-so-far solution of Luis Mendo. No credit for that, it's just a variant from Luis Mendo's, I'll explain it later.
Clearly, the solutions using arrayfun are very time-consuming. The solutions using an explicit for loop are faster, yet still slow compared with others solutions. So yes, vectorizing is still a major option for optimizing a Matlab script.
Since I've seen a big dispersion on the computing times of the fastest solutions, especially with input lengths between 100 and 10000, I decide to benchmark more precisely. So I've put the slowest apart (sorry), and redo the benchmark over the 6 other solutions which run much faster. The second benchmark over this reduced list of solutions is identical except that I've average over 1000 runs.
(No table here, unless you really want to, it's quite the same numbers as before)
As it was remarked, the solution by Daniel is a little faster than the one by Divakar because it seems that the use of bsxfun with #times is slower than using repmat. Still, they are 10 times faster than for-loop solutions: clearly, vectorizing in Matlab is a good thing.
The solutions of Bentoy13 and Luis Mendo are very close; the first one uses more instructions, but the second one uses an extra allocation when concatenating 1 to cumsum(input_lengths(1:end-1)). And that's why we see that Bentoy13's solution tends to be a bit faster with big input lengths (above 5.10^5), because there is no extra allocation. From this consideration, I've made an optimized solution where there is no extra allocation; here is the code (Luis Mendo can put this one in his answer if he wants to :) ):
result = zeros(1,sum(input_lengths));
result(1) = 1;
result(1+cumsum(input_lengths(1:end-1))) = 1;
result = cumsum(result);
Any comment for improvement is welcome.
More of a comment than anything, but I did some tests. I tried a for loop, and an arrayfun, and I tested your for loop and arrayfun version. Your for loop was the fastest. I think this is because it is simple, and allows the JIT compilation to do the most optimisation. I am using Matlab, octave might be different.
And the timing:
Solution: With JIT Without JIT
Sam for 0.74 1.22
Sam arrayfun 2.85 2.85
My for 0.62 2.57
My arrayfun 1.27 3.81
Divakar 0.26 0.28
Bentoy 0.07 0.06
Daniel 0.15 0.16
Luis Mendo 0.07 0.06
So Bentoy's code is really fast, and Luis Mendo's is almost exactly the same speed. And I rely on JIT way too much!
And the code for my attempts
clc,clear
input_lengths = randi(20,[1 10000]);
% My for loop
tic()
C=cumsum(input_lengths);
D=diff(C);
results=zeros(1,C(end));
results(1,1:C(1))=1;
for i=2:length(input_lengths)
results(1,C(i-1)+1:C(i))=i*ones(1,D(i-1));
end
toc()
tic()
A=arrayfun(#(i) i*ones(1,input_lengths(i)),1:length(input_lengths),'UniformOutput',false);
R=[A{:}];
toc()
result = zeros(1,sum(input_lengths));
result(cumsum([1 input_lengths(1:end-1)])) = 1;
result = cumsum(result);
This should be pretty fast. And memory usage is the minimum possible.
An optimized version of the above code, due to Bentoy13 (see his very detailed benchmarking):
result = zeros(1,sum(input_lengths));
result(1) = 1;
result(1+cumsum(input_lengths(1:end-1))) = 1;
result = cumsum(result);
This is a slight variant of #Daniel's answer. The crux of this solution is based on that solution. Now this one avoids repmat, so in that way it's little-more "vectorized" maybe. Here's the code -
selector=bsxfun(#le,[1:max(input_lengths)]',input_lengths); %//'
V = bsxfun(#times,selector,1:numel(input_lengths));
result = V(V~=0)
For all the desperate one-liner searching people -
result = nonzeros(bsxfun(#times,bsxfun(#le,[1:max(input_lengths)]',input_lengths),1:numel(input_lengths)))
I search an elegant solution, and I think David's solution is a good start. What I have in mind is that one can generate the indexes where to add one from previous element.
For that, if we compute the cumsum of the input vector, we get:
cumsum(input_lengths)
ans = 1 2 3 7 10 12 13
This is the indexes of the ends of sequences of identical numbers. That is not what we want, so we flip the vector twice to get the beginnings:
fliplr(sum(input_lengths)+1-cumsum(fliplr(input_lengths)))
ans = 1 2 3 4 8 11 13
Here is the trick. You flip the vector, cumsum it to get the ends of the flipped vector, and then flip back; but you must substract the vector from the total length of the output vector (+1 because index starts at 1) because cumsum applies on the flipped vector.
Once you have done this, it's very straightforward, you just have to put 1 at computed indexes and 0 elsewhere, and cumsum it:
idx_begs = fliplr(sum(input_lengths)+1-cumsum(fliplr(input_lengths)));
result = zeros(1,sum(input_lengths));
result(idx_begs) = 1;
result = cumsum(result);
EDIT
First, please have a look at Luis Mendo's solution, it is very close to mine but is more simpler and a bit faster (I won't edit mine even it is very close). I think at this date this is the fastest solution from all.
Second, while looking at others solutions, I've made up another one-liner, a little different from my initial solution and from the other one-liner. Ok, this won't be very readable, so take a breath:
result = cumsum( full(sparse(cumsum([1,input_lengths(1:end-1)]), ...
ones(1,length(input_lengths)), 1, sum(input_lengths),1)) );
I cut it on two lines. Ok now let's explain it.
The similar part is to build the array of the indexes where to increment the value of the current element. I use the solution of Luis Mendo's for that. To build in one line the solution vector, I use here the fact that it is in fact a sparse representation of the binary vector, the one we will cumsum at the very end. This sparse vector is build using our computed index vector as x positions, a vector of 1 as y positions, and 1 as the value to put at these locations. A fourth argument is given to precise the total size of the vector (important if the last element of input_lengths is not 1). Then we get the full representation of this sparse vector (else the result is a sparse vector with no empty element) and we can cumsum.
There is no use of this solution other than to give another solution to this problem. A benchmark can show that it is slower than my original solution, because of a heavier memory load.