Fast technique for normalizing a matrix in MATLAB - matlab

I want to normalise each column of a matrix in Matlab. I have tried two implementations:
Option A:
mx=max(x);
mn=min(x);
mmd=mx-mn;
for i=1:size(x,1)
xn(i,:)=((x(i,:)-mn+(mmd==0))./(mmd+(mmd==0)*2))*2-1;
end
Option B:
mn=mean(x);
sdx=std(x);
for i=1:size(x,1)
xn(i,:)=(x(i,:)-mn)./(sdx+(sdx==0));
end
However, these options take too much time for my data, e.g. 3-4 seconds on a 5000x53 matrix. Thus, is there any better solution?

Use bsxfun instead of the loop. This may be a bit faster; however, it may also use more memory (which may be an issue in your case; if you're paging, everything'll be really slow).
To normalize with mean and std, you'd write
mn = mean(x);
sd = std(x);
sd(sd==0) = 1;
xn = bsxfun(#minus,x,mn);
xn = bsxfun(#rdivide,xn,sd);

Remember, in MATLAB, vectorizing = speed.
If A is an M x N matrix,
A = rand(m,n);
minA = repmat(min(A), [size(A, 1), 1]);
normA = max(A) - min(A); % this is a vector
normA = repmat(normA, [length(normA) 1]); % this makes it a matrix
% of the same size as A
normalizedA = (A - minA)./normA; % your normalized matrix

Note: I am not providing a freshly new answer, but I am comparing the proposed answers.
Option A: Using bsxfun()
function xn = normalizeBsxfun(x)
mn = mean(x);
sd = std(x);
sd(sd==0) = eps;
xn = bsxfun(#minus,x,mn);
xn = bsxfun(#rdivide,xn,sd);
end
Option B: Using a for-loop
function xn = normalizeLoop(x)
xn = zeros(size(x));
for ii=1:size(x,2)
xaux = x(:,ii);
xn(:,ii) = (xaux - mean(xaux))./mean(xaux);
end
end
We compare both implementations for different matrix sizes:
expList = 2:0.5:5;
for ii=1:numel(expList)
expNum = round(10^expList(ii));
x = rand(expNum,expNum);
tic;
xn = normalizeBsxfun(x);
ts(ii) = toc;
tic;
xn = normalizeLoop(x);
tl(ii) = toc;
end
figure;
hold on;
plot(round(10.^expList),ts,'b');
plot(round(10.^expList),tl,'r');
legend('bsxfun','loop');
set(gca,'YScale','log')
The results show that for small matrices, the bsxfun is faster. But, the difference is neglect able for higher dimensions, as it was also found in other post.
The x-axis is the squared root number of matrix elements, while the y-axis is the computation time in seconds.

Let X be a m x n matrix and you want to normalize column wise.
The following matlab code does it
XMean = repmat(mean(X),m,1);
XStd = repmat(std(X),m,1);
X_norm = (X - XMean)./(XStd);
The element wise ./ operator is explained here: http://www.mathworks.in/help/matlab/ref/arithmeticoperators.html
Note: As op mentioned, this is simply a faster solution and performs the same task as looping through the matrix. The underlying implementation of this inbuilt function makes it work faster

Note: This code works in Octave and MATLAB versions R2016b or higher.
function X_norm = normalizeMatrix(X)
mu = mean(X); %mean
sigma = std(X); %standard deviation
X_norm = (X - mu)./sigma;
end

How about using
normc(X)
that would normalize the matrix X columnwise. You need to include the Neural Network Toolbox in your install though.

How about this?
A = [7, 2, 6; 3, 8, 4]; % a 2x3 matrix
Asum = sum(A); % sum the columns
Anorm = A./Asum(ones(size(A, 1), 1), :); % normalise the columns

Related

Vectorization and Nested Matrix Multiplication

Here is the original code:
K = zeros(N*N)
for a=1:N
for i=1:I
for j=1:J
M = kron(X(:,:,a).',Y(:,:,a,i,j));
%A function that essentially adds M to K.
end
end
end
The goal is to vectorize the kroniker multiplication calls. My intuition is to think of X and Y as containers of matrices (for reference, the slices of X and Y being fed to kron are square matrices of the order 7x7). Under this container scheme, X appears a 1-D container and Y as a 3-D container. My next guess was to reshape Y into a 2-D container or better yet a 1-D container and then do element wise multiplication of X and Y. Questions are: how would do this reshaping in a way that preserves the trace of M and can matlab even handle this idea in this container idea or do the containers need to be further reshaped to expose the inner matrix elements further?
Approach #1: Matrix multiplication with 6D permute
% Get sizes
[m1,m2,~] = size(X);
[n1,n2,N,n4,n5] = size(Y);
% Lose the third dim from X and Y with matrix-multiplication
parte1 = reshape(permute(Y,[1,2,4,5,3]),[],N)*reshape(X,[],N).';
% Rearrange the leftover dims to bring kron format
parte2 = reshape(parte1,[n1,n2,I,J,m1,m2]);
% Lose dims correspinding to last two dims coming in from Y corresponding
% to the iterative summation as suggested in the question
out = reshape(permute(sum(sum(parte2,3),4),[1,6,2,5,3,4]),m1*n1,m2*n2)
Approach #2: Simple 7D permute
% Get sizes
[m1,m2,~] = size(X);
[n1,n2,N,n4,n5] = size(Y);
% Perform kron format elementwise multiplication betwen the first two dims
% of X and Y, keeping the third dim aligned and "pushing out" leftover dims
% from Y to the back
mults = bsxfun(#times,permute(X,[4,2,5,1,3]),permute(Y,[1,6,2,7,3,4,5]));
% Lose the two dims with summation reduction for final output
out = sum(reshape(mults,m1*n1,m2*n2,[]),3);
Verification
Here's a setup for running the original and the proposed approaches -
% Setup inputs
X = rand(10,10,10);
Y = rand(10,10,10,10,10);
% Original approach
[n1,n2,N,I,J] = size(Y);
K = zeros(100);
for a=1:N
for i=1:I
for j=1:J
M = kron(X(:,:,a).',Y(:,:,a,i,j));
K = K + M;
end
end
end
% Approach #1
[m1,m2,~] = size(X);
[n1,n2,N,n4,n5] = size(Y);
mults = bsxfun(#times,permute(X,[4,2,5,1,3]),permute(Y,[1,6,2,7,3,4,5]));
out1 = sum(reshape(mults,m1*n1,m2*n2,[]),3);
% Approach #2
[m1,m2,~] = size(X);
[n1,n2,N,n4,n5] = size(Y);
parte1 = reshape(permute(Y,[1,2,4,5,3]),[],N)*reshape(X,[],N).';
parte2 = reshape(parte1,[n1,n2,I,J,m1,m2]);
out2 = reshape(permute(sum(sum(parte2,3),4),[1,6,2,5,3,4]),m1*n1,m2*n2);
After running, we see the max. absolute deviation with the proposed approaches against the original one -
>> error_app1 = max(abs(K(:)-out1(:)))
error_app1 =
1.1369e-12
>> error_app2 = max(abs(K(:)-out2(:)))
error_app2 =
1.1937e-12
Values look good to me!
Benchmarking
Timing these three approaches using the same big dataset as used for verification, we get something like this -
----------------------------- With Loop
Elapsed time is 1.541443 seconds.
----------------------------- With BSXFUN
Elapsed time is 1.283935 seconds.
----------------------------- With MATRIX-MULTIPLICATION
Elapsed time is 0.164312 seconds.
Seems like matrix-multiplication is doing fairly good for dataset of these sizes!

Vectorizing a weighted sum of matrices in MATLAB

I'm trying to vectorize the following operation in MATLAB, but it's got me stumped. I've learned from experience that there usually is a way, so I'm not giving up just yet. Any help would be appreciated.
I have a collection of m row-vectors each of size n, arranged in an m x n matrix; call it X.
I also have an m-sized vector of weights, w.
I want to compute a weighted sum of the matrices formed by the self outer products of the vectors in X.
Here is a MWE using a for loop:
m = 100;
n = 5;
X = rand(m, n);
w = rand(1, m);
S = zeros(n, n);
for i = 1 : m
S = S + (w(i) * X(i, :)' * X(i, :));
end
S
This is probably the fastest approach:
S = X' * bsxfun(#times, X, w(:));
You could also do
S = squeeze(sum(bsxfun(#times, ...
bsxfun(#times, conj(X), permute(X, [1 3 2])), w(:)), 1));
(or remove the complex conjugate if not needed).
You can employ two approaches here that use one bsxfun call and few permutes and reshapes. The reshaping trick basically allows us to use the efficient matrix multiplication and thus avoid any extra bsxfun call we might have required otherwise.
Approach #1
[m1,n1] = size(X);
XXmult = bsxfun(#times,X,permute(X,[1 3 2])); %// For X(i, :)' * X(i, :) step
S = reshape(reshape(permute(XXmult,[2 3 1]),[],m1)*w(:),n1,[]) %// multiply weights w
Approach #2
[m1,n1] = size(X);
XXmult = bsxfun(#times,permute(X,[2 3 1]),permute(X,[3 2 1]));
S = reshape(reshape(XXmult,[],m1)*w(:),n1,[])
Shortest answer, and probably fastest:
S = X'*diag(W)*X
Been using it for an unscented Kalman filter, works great.

Estimated mean and covariance calculation in matlab using maximum likelihood method

I am trying to calculate estimated mean and co-variance using maximum likelihood method in matlab. I am newbie in Matlab and having problems which i like to be cleared here.
I am using following code:
clear all;
%Visualization of 2D Gaussian Distribution
% Mean of the distribution
mu = [1 -1];
% Covariance matrix (Must be symetric)
sigma = [ 2 1 ; 1 3 ];
% Samples
X = mvnrnd(mu,sigma,1000);
analytical_mean = mean(X);
analytical_cov = cov(X);
N = size(X,1);
estimated_mean = sum(X)/N;
summation = 0;
for i=1:N,
row = X(i,:);
tmp1= (row - estimated_mean);
tmp2 = tmp1';
summation = summation + tmp2;
end
covar = summation/N;
Now analytical_mean and estimated_mean are coming equal but my calculated co-variance covar is not coming as a matrix like analytical_cov. Kindly I need to know how to calculate covar correctly.
I am using below equations:
you can try this instead
[m,n] = size(X);
estimated_mean = sum(X)/m;
tmp=zeros(m,n);
for i=1:n
tmp(:,i)= ((X(:,i) - estimated_mean(i)));
end
covar = (tmp.'*tmp)/m;
I think you want
tmp2 = tmp1'*tmp1;
instead of
tmp2 = tmp1'
That change makes covar pretty close for me:
covar =
1.9042 0.9534
0.9534 3.0195
The clue was the dimensions of covar for you code, should have been 2-by-2 but yours was 2-by-1

Vectorize octave/matlab codes

Following is the octave codes(part of kmeans)
centroidSum = zeros(K);
valueSum = zeros(K, n);
for i = 1 : m
for j = 1 : K
if(idx(i) == j)
centroidSum(j) = centroidSum(j) + 1;
valueSum(j, :) = valueSum(j, :) + X(i, :);
end
end
end
The codes work, is it possible to vectorize the codes?
It is easy to vectorize the codes without if statement,
but how could we vectorize the codes with if statement?
I assume the purpose of the code is to compute the centroids of subsets of a set of m data points in an n-dimensional space, where the points are stored in a matrix X (points x coordinates) and the vector idx specifies for each data point the subset (1 ... K) the point belongs to. Then a partial vectorization is:
centroid = zeros(K, n)
for j = 1 : K
centroid(j, :) = mean(X(idx == j, :));
end
The if is eliminated by indexing, in particular logical indexing: idx == j gives a boolean array which indicates which data points belong to subset j.
I think it might be possible to get rid of the second for-loop, too, but this would result in very convoluted, unintelligible code.
Brief introduction and solution code
This could be one fully vectorized approach based on -
accumarray: For accumulating summations as done for calulating valueSum. This also introduces a technique how one can use accumarray on a 2D matrix along a certain direction, which isn't possible in a straight-forward manner with it.
bsxfun: For calculating linear indices across all columns for matching row indices from idx.
Here's the implementation -
%// Store no. of columns in X for frequent usage later on
ncols = size(X,2);
%// Find indices in idx that are within [1:k] range, call them as labels
%// Also, find their locations in that range array, call those as pos
[pos,id] = ismember(idx,1:K);
labels = id(pos);
%// OR with bsxfun: [pos,labels] = find(bsxfun(#eq,idx(:),1:K));
%// Find all labels, i.e. across all columns of X
all_labels = bsxfun(#plus,labels(:),[0:ncols-1]*K);
%// Get truncated X corresponding to all indices matches across all columns
X_cut = X(pos,:);
%// Accumulate summations within each column based on the labels.
%// Note that accumarray doesn't accept matrices, so we were required
%// to create all_labels that had same labels within each column and
%// offsetted at constant intervals from consecutive columns
acc1 = accumarray(all_labels(:),X_cut(:));
%// Regularise accumulated array and reshape back to a 2D array version
acc1_reg2D = [acc1 ; zeros(K*ncols - numel(acc1),1)];
valueSum = reshape(acc1_reg2D,[],ncols);
centroidSum = histc(labels,1:K); %// Get labels counts as centroid sums
Benchmarking code
%// Datasize parameters
K = 5000;
n = 5000;
m = 5000;
idx = randi(9,1,m);
X = rand(m,n);
disp('----------------------------- With Original Approach')
tic
centroidSum1 = zeros(K,1);
valueSum1 = zeros(K, n);
for i = 1 : m
for j = 1 : K
if(idx(i) == j)
centroidSum1(j) = centroidSum1(j) + 1;
valueSum1(j, :) = valueSum1(j, :) + X(i, :);
end
end
end
toc, clear valueSum1 centroidSum1
disp('----------------------------- With Proposed Approach')
tic
%// ... Code from earlied mentioned section
toc
Runtime results
----------------------------- With Original Approach
Elapsed time is 1.235412 seconds.
----------------------------- With Proposed Approach
Elapsed time is 0.379133 seconds.
Not sure about its runtime performance but here's a non-convoluted vectorized implementation:
b = idx == 1:K;
centroids = (b' * X) ./ sum(b)';
Vectorizing the calculation makes a huge difference in performance. Benchmarking
The original code,
The partial vectorization from A. Donda and
The full vectorization from Tom,
gave me the following results:
Original Code: Elapsed time is 1.327877 seconds.
Partial Vectorization: Elapsed time is 0.630767 seconds.
Full Vectorization: Elapsed time is 0.021129 seconds.
Benchmarking code here:
%// Datasize parameters
K = 5000;
n = 5000;
m = 5000;
idx = randi(9,1,m);
X = rand(m,n);
fprintf('\nOriginal Code: ')
tic
centroidSum1 = zeros(K,1);
valueSum1 = zeros(K, n);
for i = 1 : m
for j = 1 : K
if(idx(i) == j)
centroidSum1(j) = centroidSum1(j) + 1;
valueSum1(j, :) = valueSum1(j, :) + X(i, :);
end
end
end
centroids = valueSum1 ./ centroidSum1;
toc, clear valueSum1 centroidSum1 centroids
fprintf('\nPartial Vectorization: ')
tic
centroids = zeros(K,n);
for k = 1:K
centroids(k,:) = mean( X(idx == k, :) );
end
toc, clear centroids
fprintf('\nFull Vectorization: ')
tic
centroids = zeros(K,n);
b = idx == 1:K;
centroids = (b * X) ./ sum(b)';
toc
Note, I added an extra line to the original code to element-wise divide valueSum1 by centroidSum1 to make the output of each type of code the same.
Finally, I know this isn't strictly an "answer", however I don't have enough reputation to add a comment, and I thought the benchmarking figures were useful to anyone who is learning MATLAB (like myself) and needs some extra motivation to master vectorization.

How to Build a Distance Matrix without a Loop (Vectorization)?

I have many points and I want to build distance matrix i.e. distance of every point with all of other points but I want to don't use from loop because take too time...
Is a better way for building this matrix?
this is my loop: for a setl with size: 10000x3 this method take a lot of my time :(
for i=1:size(setl,1)
for j=1:size(setl,1)
dist = sqrt((xl(i)-xl(j))^2+(yl(i)-yl(j))^2+...
(zl(i)-zl(j))^2);
distanceMatrix(i,j) = dist;
end
end
How about using some linear algebra? The distance of two points can be computed from the inner product of their position vectors,
D(x, y) = ∥y – x∥ = √ (
xT x + yT y – 2 xT y ),
and the inner product for all pairs of points can be obtained through a simple matrix operation.
x = [xl(:)'; yl(:)'; zl(:)'];
IP = x' * x;
d = sqrt(bsxfun(#plus, diag(IP), diag(IP)') - 2 * IP);
For 10000 points, I get the following timing results:
ahmad's loop + shoelzer's preallocation: 7.8 seconds
Dan's vectorized indices: 5.3 seconds
Mohsen's bsxfun: 1.5 seconds
my solution: 1.3 seconds
You can use bsxfun which is generally a faster solution:
s = [xl(:) yl(:) zl(:)];
d = sqrt(sum(bsxfun(#minus, permute(s, [1 3 2]), permute(s, [3 1 2])).^2,3));
You can do this fully vectorized like so:
n = numel(xl);
[X, Y] = meshgrid(1:n,1:n);
Ix = X(:)
Iy = Y(:)
reshape(sqrt((xl(Ix)-xl(Iy)).^2+(yl(Ix)-yl(Iy)).^2+(zl(Ix)-zl(Iy)).^2), n, n);
If you look at Ix and Iy (try it for like a 3x3 dataset), they make every combination of linear indexes possible for each of your matrices. Now you can just do each subtraction in one shot!
However mixing the suggestions of shoelzer and Jost will give you an almost identical performance performance boost:
n = 50;
xl = rand(n,1);
yl = rand(n,1);
zl = rand(n,1);
tic
for t = 1:100
distanceMatrix = zeros(n); %// Preallocation
for i=1:n
for j=min(i+1,n):n %// Taking advantge of symmetry
distanceMatrix(i,j) = sqrt((xl(i)-xl(j))^2+(yl(i)-yl(j))^2+(zl(i)-zl(j))^2);
end
end
d1 = distanceMatrix + distanceMatrix'; %'
end
toc
%// Vectorized solution that creates linear indices using meshgrid
tic
for t = 1:100
[X, Y] = meshgrid(1:n,1:n);
Ix = X(:);
Iy = Y(:);
d2 = reshape(sqrt((xl(Ix)-xl(Iy)).^2+(yl(Ix)-yl(Iy)).^2+(zl(Ix)-zl(Iy)).^2), n, n);
end
toc
Returns:
Elapsed time is 0.023332 seconds.
Elapsed time is 0.024454 seconds.
But if I change n to 500 then I get
Elapsed time is 1.227956 seconds.
Elapsed time is 2.030925 seconds.
Which just goes to show that you should always bench mark solutions in Matlab before writing off loops as slow! In this case, depending on the scale of your solution, loops could be significantly faster.
Be sure to preallocate distanceMatrix. Your loops will run much, much faster and vectorization probably isn't needed. Even if you do it, there may not be any further speed increase.
The latest versions (Since R2016b) of MATLAB support Implicit Broadcasting (See also noted on bsxfun()).
Hence the fastest way for distance matrix is:
function [ mDistMat ] = CalcDistanceMatrix( mA, mB )
mDistMat = sum(mA .^ 2).' - (2 * mA.' * mB) + sum(mB .^ 2);
end
Where the points are along the columns of the set.
In your case mA = mB.
Have a look on my Calculate Distance Matrix Project.