Calculating Hamming weight efficiently in matlab - matlab

Given a MATLAB uint32 to be interpreted as a bit string, what is an efficient and concise way of counting how many nonzero bits are in the string?
I have a working, naive approach which loops over the bits, but that's too slow for my needs. (A C++ implementation using std::bitset count() runs almost instantly).
I've found a pretty nice page listing various bit counting techniques, but I'm hoping there is an easy MATLAB-esque way.
http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
Update #1
Just implemented the Brian Kernighan algorithm as follows:
w = 0;
while ( bits > 0 )
bits = bitand( bits, bits-1 );
w = w + 1;
end
Performance is still crappy, over 10 seconds to compute just 4096^2 weight calculations. My C++ code using count() from std::bitset does this in subsecond time.
Update #2
Here is a table of run times for the techniques I've tried so far. I will update it as I get additional ideas/suggestions.
Vectorized Scheiner algorithm => 2.243511 sec
Vectorized Naive bitget loop => 7.553345 sec
Kernighan algorithm => 17.154692 sec
length( find( bitget( val, 1:32 ) ) ) => 67.368278 sec
nnz( bitget( val, 1:32 ) ) => 349.620259 sec
Justin Scheiner's algorithm, unrolled loops => 370.846031 sec
Justin Scheiner's algorithm => 398.786320 sec
Naive bitget loop => 456.016731 sec
sum(dec2bin(val) == '1') => 1069.851993 sec
Comment: The dec2bin() function in MATLAB seems to be very poorly implemented. It runs extremely slow.
Comment: The "Naive bitget loop" algorithm is implemented as follows:
w=0;
for i=1:32
if bitget( val, i ) == 1
w = w + 1;
end
end
Comment:
The loop unrolled version of Scheiner's algorithm looks as follows:
function w=computeWeight( val )
w = val;
w = bitand(bitshift(w, -1), uint32(1431655765)) + ...
bitand(w, uint32(1431655765));
w = bitand(bitshift(w, -2), uint32(858993459)) + ...
bitand(w, uint32(858993459));
w = bitand(bitshift(w, -4), uint32(252645135)) + ...
bitand(w, uint32(252645135));
w = bitand(bitshift(w, -8), uint32(16711935)) + ...
bitand(w, uint32(16711935));
w = bitand(bitshift(w, -16), uint32(65535)) + ...
bitand(w, uint32(65535));

I'd be interested to see how fast this solution is:
function r = count_bits(n)
shifts = [-1, -2, -4, -8, -16];
masks = [1431655765, 858993459, 252645135, 16711935, 65535];
r = n;
for i=1:5
r = bitand(bitshift(r, shifts(i)), masks(i)) + ...
bitand(r, masks(i));
end
Going back, I see that this is the 'parallel' solution given on the bithacks page.

Unless this is a MATLAB implementation exercise, you might want to just take your fast C++ implementation and compile it as a mex function, once per target platform.

EDIT: NEW SOLUTION
It appears that you want to repeat the calculation for every element in a 4096-by-4096 array of UINT32 values. If this is what you are doing, I think the fastest way to do it in MATLAB is to use the fact that BITGET is designed to operate on matrices of values. The code would look like this:
numArray = ...your 4096-by-4096 matrix of uint32 values...
w = zeros(4096,4096,'uint32');
for iBit = 1:32,
w = w+bitget(numArray,iBit);
end
If you want to make vectorized versions of some of the other algorithms, I believe BITAND is also designed to operate on matrices.
The old solution...
The easiest way I can think of is to use the DEC2BIN function, which gives you the binary representation (as a string) of a non-negative integer:
w = sum(dec2bin(num) == '1'); % Sums up the ones in the string
It's slow, but it's easy. =)

Implemented the "Best 32 bit Algorithm" from the Stanford link at the top.
The improved algorithm reduced processing time by 6%.
Also optimized the segment size and found that 32K is stable and improves time by 15% over 4K.
Expect 4Kx4K time to be 40% of Vectorized Scheiner Algorithm.
function w = Ham(w)
% Input uint32
% Output vector of Ham wts
for i=1:32768:length(w)
w(i:i+32767)=Ham_seg(w(i:i+32767));
end
end
% Segmentation gave reduced time by 50%
function w=Ham_seg(w)
%speed
b1=uint32(1431655765);
b2=uint32(858993459);
b3=uint32(252645135);
b7=uint32(63); % working orig binary mask
w = bitand(bitshift(w, -1), b1) + bitand(w, b1);
w = bitand(bitshift(w, -2), b2) + bitand(w, b2);
w =bitand(w+bitshift(w, -4),b3);
w =bitand(bitshift(w,-24)+bitshift(w,-16)+bitshift(w,-8)+w,b7);
end

Did some timing comparisons on Matlab Cody.
Determined a Segmented Modified Vectorized Scheiner gives optimimum performance.
Have >50% time reduction based on Cody 1.30 sec to 0.60 sec change for an L=4096*4096 vector.
function w = Ham(w)
% Input uint32
% Output vector of Ham wts
b1=uint32(1431655765); % evaluating saves 15% of time 1.30 to 1.1 sec
b2=uint32(858993459);
b3=uint32(252645135);
b4=uint32(16711935);
b5=uint32(65535);
for i=1:4096:length(w)
w(i:i+4095)=Ham_seg(w(i:i+4095),b1,b2,b3,b4,b5);
end
end
% Segmentation reduced time by 50%
function w=Ham_seg(w,b1,b2,b3,b4,b5)
% Passing variables or could evaluate b1:b5 here
w = bitand(bitshift(w, -1), b1) + bitand(w, b1);
w = bitand(bitshift(w, -2), b2) + bitand(w, b2);
w = bitand(bitshift(w, -4), b3) + bitand(w, b3);
w = bitand(bitshift(w, -8), b4) + bitand(w, b4);
w = bitand(bitshift(w, -16), b5) + bitand(w, b5);
end
vt=randi(2^32,[4096*4096,1])-1;
% for vt being uint32 the floor function gives unexpected values
tic
v=num_ones(mod(vt,65536)+1)+num_ones(floor(vt/65536)+1); % 0.85 sec
toc
% a corrected method is
v=num_ones(mod(vt,65536)+1)+num_ones(floor(double(vt)/65536)+1);
toc

A fast approach is counting the bits in each byte using a lookup table, then summing these values; indeed, it's one of the approaches suggested on the web page given in the question. The nice thing about this approach is that both lookup and sum are vectorizable operations in MATLAB, so you can vectorize this approach and compute the hamming weight / number of set bits of a large number of bit strings simultaneously, very quickly. This approach is implemented in the bitcount submission on the MATLAB File Exchange.

Try splitting the job into smaller parts. My guess is that if you want to process all data at once, matlab is trying to do each operation on all integers before taking successive steps and the processor's cache is invalidated with each step.
for i=1:4096,
«process bits(i,:)»
end

I'm reviving an old thread here, but I ran across this problem and I wrote this little bit of code for it:
distance = sum(bitget(bits, 1:32));
Looks pretty concise, but I'm scared that bitget is implemented in O(n) bitshift operations. The code works for what I'm going, but my problem set doesn't rely on hamming weight.

num_ones=uint8(zeros(intmax('uint32')/2^6,1));
% one time load of array not implemented here
tic
for i=1:4096*4096
%v=num_ones(rem(i,64)+1)+num_ones(floor(i/64)+1); % 1.24 sec
v=num_ones(mod(i,64)+1)+num_ones(floor(i/64)+1); % 1.20 sec
end
toc
tic
num_ones=uint8(zeros(65536,1));
for i=0:65535
num_ones(i+1)=length( find( bitget( i, 1:32 ) ) ) ;
end
toc
% 0.43 sec to load
% smaller array to initialize
% one time load of array
tic
for i=1:4096*4096
v=num_ones(mod(i,65536)+1)+num_ones(floor(i/65536)+1); % 0.95 sec
%v=num_ones(mod(i,65536)+1)+num_ones(bitshift(i,-16)+1); % 16 sec for 4K*1K
end
toc
%vectorized
tic
num_ones=uint8(zeros(65536,1));
for i=0:65535
num_ones(i+1)=length( find( bitget( i, 1:32 ) ) ) ;
end % 0.43 sec
toc
vt=randi(2^32,[4096*4096,1])-1;
tic
v=num_ones(mod(vt,65536)+1)+num_ones(floor(vt/65536)+1); % 0.85 sec
toc

Related

Why adding numbers on a single CPU core is 6 times faster than on a 32-core GPU? [duplicate]

I'm trying to speed up my computing by using gpuArray. However, that's not the case for my code below.
for i=1:10
calltest;
end
function [t1,t2]=calltest
N=10;
tic
u=gpuArray(rand(1,N).^(1./[N:-1:1]));
t1=toc
tic
u2=rand(1,N).^(1./[N:-1:1]);
t2=toc
end
where I get
t1 =
4.8445e-05
t2 =
1.4369e-05
I have an Nvidia GTX850M graphic card. Am I using gpuArray incorrectly? This code is wrapped inside a function, and the function is called by a loop thousands of times.
Why?Because there is botha) a small problem-scale &b) not a "mathematically-dense" GPU-kernel
The method of comparison is blurring the root-cause of the problem
Step 0: separate data-set ( a vector ) creation from section-under-test:
N = 10;
R = rand( 1, N );
tic; < a-gpu-based-computing-section>; GPU_t = toc
tic; c = R.^( 1. / [N:-1:1] ); CPU_t = toc
Step 1: test the scaling:
trying just 10 elements, will not make the observation clear, as an overhead-naive formulation of Amdahl Law does not explicitly emphasise the added time, spent on CPU-based GPU-kernel assembly & transport + ( CPU-to-GPU + GPU-to-CPU ) data-handling phases. These add-on phases may get negligibly small, if compared to
a) an indeed large-scale vector / matrix GPU-kernel processing, which N ~10 obviously is not
or
b) an indeed "mathematically-dense" GPU-kernel processing, which R.^() obviously is not
so,
do not blame the GPU-computing for having acquired a must-do part ( the overheads ) as it cannot get working without this prior add-ons in time ( and CPU may, during the same amount of time, produce the final result - Q.E.D. )
Fine-grain measurement, per each of the CPU-GPU-CPU-workflow sections:
N = 10; %% 100, 1000, 10000, 100000, ..
tic; CPU_hosted = rand( N, 'single' ); %% 'double'
CPU_gen_RAND = toc
tic; GPU_hosted_IN1 = gpuArray( CPU_hosted );
GPU_xfer_h2d = toc
tic; GPU_hosted_IN2 = rand( N, 'gpuArray' );
GPU_gen__h2d = toc
tic; <kernel-generation-with-might-be-lazy-eval-deferred-xfer-setup>;
GPU_kernel_AssyExec = toc
tic; CPU_hosted_RES = gather( GPU_hosted_RES );
GPU_xfer_d2h = toc

Vectorizing a matlab loop with internal functions

I have a 3D Mesh grid, X, Y, Z. I want to create a new 3D array that is a function of X, Y, & Z. That function comprises the sum of several 3D Gaussians located at different points. Currently, I have a for loop that runs over the different points where I have my gaussians, and I have an array of center locations r0(nGauss, 1:3)
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
where my 3D gaussian function is
function output=Gauss3D(X,Y,Z,r0)
output=exp(-(X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2);
end
I'm happy to redesign the function, which is the slowest part of my code and has to happen many many time, but I can't figure out how to vectorize this so that it will run faster. Any suggestions would be appreciated
*****NB the Original function had a square root in it, and has been modified to make it an actual gaussian***
NOTE! I've modified your code to create a Gaussian, which was:
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
That does not make a Gaussian. I changed this to:
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
(note no sqrt). This is a Gaussian with sigma = sqrt(1/2).
If this is not what you want, then this answer might not be very useful to you, because your function does not go to 0 as fast as the Gaussian, and therefore is harder to truncate, and it is not separable.
Vectorizing this code is pointless, as the other answers attest. MATLAB's JIT is perfectly capable of running this as fast as it'll go. But you can reduce the amount of computation significantly by noting that the Gaussian goes to almost zero very quickly, and is separable:
Most of the exp evaluations you're doing here yield a very tiny number. You don't need to compute those, just fill in 0.
exp(-x.^2-y.^2) is the same as exp(-x.^2).*exp(-y.^2), which is much cheaper to compute.
Let's put these two things to the test. Here is the test code:
function gaussian_test
N = 100;
r0 = rand(N,3)*20 - 10;
% Original
tic
[X,Y,Z] = meshgrid(-10:.1:10);
Psi1 = zeros(size(X));
for index = 1:N
Psi1 = Psi1 + Gauss3D(X,Y,Z,r0(index,:));
end
t = toc;
fprintf('original, time = %f\n',t)
% Fast, large truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi2 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi2 = Gauss3D_fast(Psi2,X,Y,Z,r0(index,:),5);
end
t = toc;
fprintf('tuncation = 5, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi2-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi2-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi2-Psi1),[],1)))
% Fast, smaller truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi3 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi3 = Gauss3D_fast(Psi3,X,Y,Z,r0(index,:),3);
end
t = toc;
fprintf('tuncation = 3, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi3-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi3-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi3-Psi1),[],1)))
% DIPimage, same smaller truncation
tic
Psi4 = newim(201,201,201);
coords = (r0+10) * 10;
Psi4 = gaussianblob(Psi4,coords,10*sqrt(1/2),(pi*100).^(3/2));
t = toc;
fprintf('DIPimage, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi4-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi4-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi4-Psi1),[],1)))
end % of function gaussian_test
function output = Gauss3D(X,Y,Z,r0)
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function Psi = Gauss3D_fast(Psi,X,Y,Z,r0,trunc)
% sigma = sqrt(1/2)
x = X-r0(1);
y = Y-r0(2);
z = Z-r0(3);
mx = abs(x) < trunc*sqrt(1/2);
my = abs(y) < trunc*sqrt(1/2);
mz = abs(z) < trunc*sqrt(1/2);
Psi(my,mx,mz) = Psi(my,mx,mz) + exp(-x(mx).^2) .* reshape(exp(-y(my).^2),[],1) .* reshape(exp(-z(mz).^2),1,1,[]);
% Note! the line above uses implicit singleton expansion. For older MATLABs use bsxfun
end
This is the output on my machine, reordered for readability (I'm still on MATLAB R2017a):
| time(s) | mean abs | mean sq. | max abs
--------------+----------+----------+----------+----------
original | 5.035762 | | |
tuncation = 5 | 0.169807 | 0.000000 | 0.000000 | 0.000005
tuncation = 3 | 0.054737 | 0.000452 | 0.000002 | 0.024378
DIPimage | 0.044099 | 0.000452 | 0.000002 | 0.024378
As you can see, using these two properties of the Gaussian we can reduce time from 5.0 s to 0.17 s, a 30x speedup, with hardly noticeable differences (truncating at 5*sigma). A further 3x speedup can be gained by allowing a small error. The smallest the truncation value, the faster this will go, but the larger the error will be.
I added that last method, the gaussianblob function from DIPimage (I'm an author), just to show that option in case you need to squeeze that bit of extra time from your code. That function is implemented in C++. This version that I used you will need to compile yourself. Our current official release implements this function still in M-file code, and is not as fast.
Further chance of improvement is if the fractional part of the coordinates is always the same (w.r.t. the pixel grid). In this case, you can draw the Gaussian once, and shift it over to each of the centroids.
Another alternative involves computing the Gaussian once, at a somewhat larger scale, and interpolating into it to generate each of the 1D Gaussians needed to generate the output. I did not implement this, I have no idea if it will be faster or if the time difference will be significant. In the old days, exp was expensive, I'm not sure this is still the case.
So, I am building off of the answer above me #Durkee. I enjoy these kinds of problems, so I thought a little about how to make each of the expansions implicit, and I have the one-line function below. Using this function I shaved .11 seconds off of the call, which is completely negligible. It looks like yours is pretty decent. The only advantage of mine might be how the code scales on a finer mesh.
xLin = [-10:.1:10]';
tic
psi2 = sum(exp(-sqrt((permute(xLin-r0(:,1)',[3 1 4 2])).^2 ...
+ (permute(xLin-r0(:,2)',[1 3 4 2])).^2 ...
+ (permute(xLin-r0(:,3)',[3 4 1 2])).^2)),4);
toc
The relative run times on my computer were (all things kept the same):
Original - 1.234085
Other - 2.445375
Mine - 1.120701
So this is a bit of an unusual problem where on my computer the unvectorized code actually works better than the vectorized code, here is my script
clear
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
nGauss = 20; %Sample nGauss as you didn't specify
r0 = rand(nGauss,3); % Just make this up as it doesn't really matter in this case
% Your original code
tic
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
toc
% Vectorize these functions so we can use implicit broadcasting
X1 = X(:);
Y1 = Y(:);
Z1 = Z(:);
tic
val = [X1 Y1 Z1];
% Change the dimensions so that r0 operates on the right elements
r0_temp = permute(r0,[3 2 1]);
% Perform the gaussian combination
out = sum(exp(-sqrt(sum((val-r0_temp).^2,2))),3);
toc
% Check to make sure both functions match
sum(abs(vec(Psi)-vec(out)))
function output=Gauss3D(X,Y,Z,r0)
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function out = vec(in)
out = in(:);
end
As you can see, this is probably about as vectorized as you can get. The whole function is done using broadcasting and vectorized operations which normally improve performance ten-one hundredfold. However, in this case, this is not what we see
Elapsed time is 1.876460 seconds.
Elapsed time is 2.909152 seconds.
This actually shows the unvectorized version as being faster.
There could be a few reasons for this of which I am by no means an expert.
MATLAB uses a JIT compiler now which means that for loops are no longer inefficient.
Your code is already reasonably vectorized, you are operating at 8 million elements at once
Unless nGauss is 1000 or something, you're not looping through that much, and at that point, vectorization means you will run out of memory
I could be hitting some memory threshold where I am using too much memory and that is making my code inefficient, I noticed that when I lowered the resolution on the meshgrid the vectorized version worked better
As an aside, I tested this on my GTX 1060 GPU with single precision(single precision is 10x faster than double precision on most GPUs)
Elapsed time is 0.087405 seconds.
Elapsed time is 0.241456 seconds.
Once again the unvectorized version is faster, sorry I couldn't help you out but it seems that your code is about as good as you are going to get unless you lower the tolerances on your meshgrid.

Apply function to rolling window

Say I have a long list A of values (say of length 1000) for which I want to compute the std in pairs of 100, i.e. I want to compute std(A(1:100)), std(A(2:101)), std(A(3:102)), ..., std(A(901:1000)).
In Excel/VBA one can easily accomplish this by writing e.g. =STDEV(A1:A100) in one cell and then filling down in one go. Now my question is, how could one accomplish this efficiently in Matlab without having to use any expensive for-loops.
edit: Is it also possible to do this for a list of time series, e.g. when A has dimensions 1000 x 4 (i.e. 4 time series of length 1000)? The output matrix should then have dimensions 901 x 4.
Note: For the fastest solution see Luis Mendo's answer
So firstly using a for loop for this (especially if those are your actual dimensions) really isn't going to be expensive. Unless you're using a very old version of Matlab, the JIT compiler (together with pre-allocation of course) makes for loops inexpensive.
Secondly - have you tried for loops yet? Because you should really try out the naive implementation first before you start optimizing prematurely.
Thirdly - arrayfun can make this a one liner but it is basically just a for loop with extra overhead and very likely to be slower than a for loop if speed really is your concern.
Finally some code:
n = 1000;
A = rand(n,1);
l = 100;
for loop (hardly bulky, likely to be efficient):
S = zeros(n-l+1,1); %//Pre-allocation of memory like this is essential for efficiency!
for t = 1:(n-l+1)
S(t) = std(A(t:(t+l-1)));
end
A vectorized (memory in-efficient!) solution:
[X,Y] = meshgrid(1:l)
S = std(A(X+Y-1))
A probably better vectorized solution (and a one-liner) but still memory in-efficient:
S = std(A(bsxfun(#plus, 0:l-1, (1:l)')))
Note that with all these methods you can replace std with any function so long as it is applies itself to the columns of the matrix (which is the standard in Matlab)
Going 2D:
To go 2D we need to go 3D
n = 1000;
k = 4;
A = rand(n,k);
l = 100;
ind = bsxfun(#plus, permute(o:n:(k-1)*n, [3,1,2]), bsxfun(#plus, 0:l-1, (1:l)')); %'
S = squeeze(std(A(ind)));
M = squeeze(mean(A(ind)));
%// etc...
OR
[X,Y,Z] = meshgrid(1:l, 1:l, o:n:(k-1)*n);
ind = X+Y+Z-1;
S = squeeze(std(A(ind)))
M = squeeze(mean(A(ind)))
%// etc...
OR
ind = bsxfun(#plus, 0:l-1, (1:l)'); %'
for t = 1:k
S = std(A(ind));
M = mean(A(ind));
%// etc...
end
OR (taken from Luis Mendo's answer - note in his answer he shows a faster alternative to this simple loop)
S = zeros(n-l+1,k);
M = zeros(n-l+1,k);
for t = 1:(n-l+1)
S(t,:) = std(A(k:(k+l-1),:));
M(t,:) = mean(A(k:(k+l-1),:));
%// etc...
end
What you're doing is basically a filter operation.
If you have access to the image processing toolbox,
stdfilt(A,ones(101,1)) %# assumes that data series are in columns
will do the trick (no matter the dimensionality of A). Note that if you also have access to the parallel computing toolbox, you can let filter operations like these run on a GPU, although your problem might be too small to generate noticeable speedups.
To minimize number of operations, you can exploit the fact that the standard deviation can be computed as a difference involving second and first moments,
and moments over a rolling window are obtained efficiently with a cumulative sum (using cumsum):
A = randn(1000,4); %// random data
N = 100; %// window size
c = size(A,2);
A1 = [zeros(1,c); cumsum(A)];
A2 = [zeros(1,c); cumsum(A.^2)];
S = sqrt( (A2(1+N:end,:)-A2(1:end-N,:) ...
- (A1(1+N:end,:)-A1(1:end-N,:)).^2/N) / (N-1) ); %// result
Benchmarking
Here's a comparison against a loop based solution, using timeit. The loop approach is as in Dan's solution but adapted to the 2D case, exploting the fact that std works along each column in a vectorized manner.
%// File loop_approach.m
function S = loop_approach(A,N);
[n, p] = size(A);
S = zeros(n-N+1,p);
for k = 1:(n-N+1)
S(k,:) = std(A(k:(k+N-1),:));
end
%// File bsxfun_approach.m
function S = bsxfun_approach(A,N);
[n, p] = size(A);
ind = bsxfun(#plus, permute(0:n:(p-1)*n, [3,1,2]), bsxfun(#plus, 0:n-N, (1:N).')); %'
S = squeeze(std(A(ind)));
%// File cumsum_approach.m
function S = cumsum_approach(A,N);
c = size(A,2);
A1 = [zeros(1,c); cumsum(A)];
A2 = [zeros(1,c); cumsum(A.^2)];
S = sqrt( (A2(1+N:end,:)-A2(1:end-N,:) ...
- (A1(1+N:end,:)-A1(1:end-N,:)).^2/N) / (N-1) );
%// Benchmarking code
clear all
A = randn(1000,4); %// Or A = randn(1000,1);
N = 100;
t_loop = timeit(#() loop_approach(A,N));
t_bsxfun = timeit(#() bsxfun_approach(A,N));
t_cumsum = timeit(#() cumsum_approach(A,N));
disp(' ')
disp(['loop approach: ' num2str(t_loop)])
disp(['bsxfun approach: ' num2str(t_bsxfun)])
disp(['cumsum approach: ' num2str(t_cumsum)])
disp(' ')
disp(['bsxfun/loop gain factor: ' num2str(t_loop/t_bsxfun)])
disp(['cumsum/loop gain factor: ' num2str(t_loop/t_cumsum)])
Results
I'm using Matlab R2014b, Windows 7 64 bits, dual core processor, 4 GB RAM:
4-column case:
loop approach: 0.092035
bsxfun approach: 0.023535
cumsum approach: 0.0002338
bsxfun/loop gain factor: 3.9106
cumsum/loop gain factor: 393.6526
Single-column case:
loop approach: 0.085618
bsxfun approach: 0.0040495
cumsum approach: 8.3642e-05
bsxfun/loop gain factor: 21.1431
cumsum/loop gain factor: 1023.6236
So the cumsum-based approach seems to be the fastest: about 400 times faster than the loop in the 4-column case, and 1000 times faster in the single-column case.
Several functions can do the job efficiently in Matlab.
On one side, you can use functions such as colfilt or nlfilter, which performs computations on sliding blocks. colfilt is way more efficient than nlfilter, but can be used only if the order of the elements inside a block does not matter. Here is how to use it on your data:
S = colfilt(A, [100,1], 'sliding', #std);
or
S = nlfilter(A, [100,1], #std);
On your example, you can clearly see the difference of performance. But there is a trick : both functions pad the input array so that the output vector has the same size as the input array. To get only the relevant part of the output vector, you need to skip the first floor((100-1)/2) = 49 first elements, and take 1000-100+1 values.
S(50:end-50)
But there is also another solution, close to colfilt, more efficient. colfilt calls col2im to reshape the input vector into a matrix on which it applies the given function on each distinct column. This transforms your input vector of size [1000,1] into a matrix of size [100,901]. But colfilt pads the input array with 0 or 1, and you don't need it. So you can run colfilt without the padding step, then apply std on each column and this is easy because std applied on a matrix returns a row vector of the stds of the columns. Finally, transpose it to get a column vector if you want. In brief and in one line:
S = std(im2col(X,[100 1],'sliding')).';
Remark: if you want to apply a more complex function, see the code of colfilt, line 144 and 147 (for v2013b).
If your concern is speed of the for loop, you can greatly reduce the number of loop iteration by folding your vector into an array (using reshape) with the columns having the number of element you want to apply your function on.
This will let Matlab and the JIT perform the optimization (and in most case they do that way better than us) by calculating your function on each column of your array.
You then reshape an offseted version of your array and do the same. You will still need a loop but the number of iteration will only be l (so 100 in your example case), instead of n-l+1=901 in a classic for loop (one window at a time).
When you're done, you reshape the array of result in a vector, then you still need to calculate manually the last window, but overall it is still much faster.
Taking the same input notation than Dan:
n = 1000;
A = rand(n,1);
l = 100;
It will take this shape:
width = (n/l)-1 ; %// width of each line in the temporary result array
tmp = zeros( l , width ) ; %// preallocation never hurts
for k = 1:l
tmp(k,:) = std( reshape( A(k:end-l+k-1) , l , [] ) ) ; %// calculate your stat on the array (reshaped vector)
end
S2 = [tmp(:) ; std( A(end-l+1:end) ) ] ; %// "unfold" your results then add the last window calculation
If I tic ... toc the complete loop version and the folded one, I obtain this averaged results:
Elapsed time is 0.057190 seconds. %// windows by window FOR loop
Elapsed time is 0.016345 seconds. %// "Folded" FOR loop
I know tic/toc is not the way to go for perfect timing but I don't have the timeit function on my matlab version. Besides, the difference is significant enough to show that there is an improvement (albeit not precisely quantifiable by this method). I removed the first run of course and I checked that the results are consistent with different matrix sizes.
Now regarding your "one liner" request, I suggest your wrap this code into a function like so:
function out = foldfunction( func , vec , nPts )
n = length( vec ) ;
width = (n/nPts)-1 ;
tmp = zeros( nPts , width ) ;
for k = 1:nPts
tmp(k,:) = func( reshape( vec(k:end-nPts+k-1) , nPts , [] ) ) ;
end
out = [tmp(:) ; func( vec(end-nPts+1:end) ) ] ;
Which in your main code allows you to call it in one line:
S = foldfunction( #std , A , l ) ;
The other great benefit of this format, is that you can use the very same sub function for other statistical function. For example, if you want the "mean" of your windows, you call the same just changing the func argument:
S = foldfunction( #mean , A , l ) ;
Only restriction, as it is it only works for vector as input, but with a bit of rework it could be made to take arrays as input too.

tensile tests in matlab

The problem says:
Three tensile tests were carried out on an aluminum bar. In each test the strain was measured at the same values of stress. The results were
where the units of strain are mm/m.Use linear regression to estimate the modulus of elasticity of the bar (modulus of elasticity = stress/strain).
I used this program for this problem:
function coeff = polynFit(xData,yData,m)
% Returns the coefficients of the polynomial
% a(1)*x^(m-1) + a(2)*x^(m-2) + ... + a(m)
% that fits the data points in the least squares sense.
% USAGE: coeff = polynFit(xData,yData,m)
% xData = x-coordinates of data points.
% yData = y-coordinates of data points.
A = zeros(m); b = zeros(m,1); s = zeros(2*m-1,1);
for i = 1:length(xData)
temp = yData(i);
for j = 1:m
b(j) = b(j) + temp;
temp = temp*xData(i);
end
temp = 1;
for j = 1:2*m-1
s(j) = s(j) + temp;
temp = temp*xData(i);
end
end
for i = 1:m
for j = 1:m
A(i,j) = s(i+j-1);
end
end
% Rearrange coefficients so that coefficient
% of x^(m-1) is first
coeff = flipdim(gaussPiv(A,b),1);
The problem is solved without a program as follows
MY ATTEMPT
T=[34.5,69,103.5,138];
D1=[.46,.95,1.48,1.93];
D2=[.34,1.02,1.51,2.09];
D3=[.73,1.1,1.62,2.12];
Mod1=T./D1;
Mod2=T./D2;
Mod3=T./D3;
xData=T;
yData1=Mod1;
yData2=Mod2;
yData3=Mod3;
coeff1 = polynFit(xData,yData1,2);
coeff2 = polynFit(xData,yData2,2);
coeff3 = polynFit(xData,yData3,2);
x1=(0:.5:190);
y1=coeff1(2)+coeff1(1)*x1;
subplot(1,3,1);
plot(x1,y1,xData,yData1,'o');
y2=coeff2(2)+coeff2(1)*x1;
subplot(1,3,2);
plot(x1,y2,xData,yData2,'o');
y3=coeff3(2)+coeff3(1)*x1;
subplot(1,3,3);
plot(x1,y3,xData,yData3,'o');
What do I have to do to get this result?
As a general advice:
avoid for loops wherever possible.
avoid using i and j as variable names, as they are Matlab built-in names for the imaginary unit (I really hope that disappears in a future release...)
Due to m being an interpreted language, for-loops can be very slow compared to their compiled alternatives. Matlab is named MATtrix LABoratory, meaning it is highly optimized for matrix/array operations. Usually, when there is an operation that cannot be done without a loop, Matlab has a built-in function for it that runs way way faster than a for-loop in Matlab ever will. For example: computing the mean of elements in an array: mean(x). The sum of all elements in an array: sum(x). The standard deviation of elements in an array: std(x). etc. Matlab's power comes from these built-in functions.
So, your problem. You have a linear regression problem. The easiest way in Matlab to solve this problem is this:
%# your data
stress = [ %# in Pa
34.5 69 103.5 138] * 1e6;
strain = [ %# in m/m
0.46 0.95 1.48 1.93
0.34 1.02 1.51 2.09
0.73 1.10 1.62 2.12]' * 1e-3;
%# make linear array for the data
yy = strain(:);
xx = repmat(stress(:), size(strain,2),1);
%# re-formulate the problem into linear system Ax = b
A = [xx ones(size(xx))];
b = yy;
%# solve the linear system
x = A\b;
%# modulus of elasticity is coefficient
%# NOTE: y-offset is relatively small and can be ignored)
E = 1/x(1)
What you did in the function polynFit is done by A\b, but the \-operator is capable of doing it way faster, way more robust and way more flexible than what you tried to do yourself. I'm not saying you shouldn't try to make these thing yourself (please keep on doing that, you learn a lot from it!), I'm saying that for the "real" results, always use the \-operator (and check your own results against it as well).
The backslash operator (type help \ on the command prompt) is extremely useful in many situations, and I advise you learn it and learn it well.
I leave you with this: here's how I would write your polynFit function:
function coeff = polynFit(X,Y,m)
if numel(X) ~= numel(X)
error('polynFit:size_mismathc',...
'number of elements in matrices X and Y must be equal.');
end
%# bad condition number, rank errors, etc. taken care of by \
coeff = bsxfun(#power, X(:), m:-1:0) \ Y(:);
end
I leave it up to you to figure out how this works.

pdist2 equivalent in MATLAB version 7

I need to calculate the euclidean distance between 2 matrices in matlab. Currently I am using bsxfun and calculating the distance as below( i am attaching a snippet of the code ):
for i=1:4754
test_data=fea_test(i,:);
d=sqrt(sum(bsxfun(#minus, test_data, fea_train).^2, 2));
end
Size of fea_test is 4754x1024 and fea_train is 6800x1024 , using his for loop is causing the execution of the for to take approximately 12 minutes which I think is too high.
Is there a way to calculate the euclidean distance between both the matrices faster?
I was told that by removing unnecessary for loops I can reduce the execution time. I also know that pdist2 can help reduce the time for calculation but since I am using version 7. of matlab I do not have the pdist2 function. Upgrade is not an option.
Any help.
Regards,
Bhavya
Here is vectorized implementation for computing the euclidean distance that is much faster than what you have (even significantly faster than PDIST2 on my machine):
D = sqrt( bsxfun(#plus,sum(A.^2,2),sum(B.^2,2)') - 2*(A*B') );
It is based on the fact that: ||u-v||^2 = ||u||^2 + ||v||^2 - 2*u.v
Consider below a crude comparison between the two methods:
A = rand(4754,1024);
B = rand(6800,1024);
tic
D = pdist2(A,B,'euclidean');
toc
tic
DD = sqrt( bsxfun(#plus,sum(A.^2,2),sum(B.^2,2)') - 2*(A*B') );
toc
On my WinXP laptop running R2011b, we can see a 10x times improvement in time:
Elapsed time is 70.939146 seconds. %# PDIST2
Elapsed time is 7.879438 seconds. %# vectorized solution
You should be aware that it does not give exactly the same results as PDIST2 down to the smallest precision.. By comparing the results, you will see small differences (usually close to eps the floating-point relative accuracy):
>> max( abs(D(:)-DD(:)) )
ans =
1.0658e-013
On a side note, I've collected around 10 different implementations (some are just small variations of each other) for this distance computation, and have been comparing them. You would be surprised how fast simple loops can be (thanks to the JIT), compared to other vectorized solutions...
You could fully vectorize the calculation by repeating the rows of fea_test 6800 times, and of fea_train 4754 times, like this:
rA = size(fea_test,1);
rB = size(fea_train,1);
[I,J]=ndgrid(1:rA,1:rB);
d = zeros(rA,rB);
d(:) = sqrt(sum(fea_test(J(:),:)-fea_train(I(:),:)).^2,2));
However, this would lead to intermediary arrays of size 6800x4754x1024 (*8 bytes for doubles), which will take up ~250GB of RAM. Thus, the full vectorization won't work.
You can, however, reduce the time of the distance calculation by preallocation, and by not calculating the square root before it's necessary:
rA = size(fea_test,1);
rB = size(fea_train,1);
d = zeros(rA,rB);
for i = 1:rA
test_data=fea_test(i,:);
d(i,:)=sum( (test_data(ones(nB,1),:) - fea_train).^2, 2))';
end
d = sqrt(d);
Try this vectorized version, it should be pretty efficient. Edit: just noticed that my answer is similar to #Amro's.
function K = calculateEuclideanDist(P,Q)
% Vectorized method to compute pairwise Euclidean distance
% Returns K(i,j) = sqrt((P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:)))
[nP, d] = size(P);
[nQ, d] = size(Q);
pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);
K = sqrt(ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q');
end