I want to realize component-wise matrix multiplication in MATLAB, which can be done using numpy.einsum in Python as below:
import numpy as np
M = 2
N = 4
I = 2000
J = 300
A = np.random.randn(M, M, I)
B = np.random.randn(M, M, N, J, I)
C = np.random.randn(M, J, I)
# using einsum
D = np.einsum('mki, klnji, lji -> mnji', A, B, C)
# naive for-loop
E = np.zeros(M, N, J, I)
for i in range(I):
for j in range(J):
for n in range(N):
E[:,n,j,i] = B[:,:,i] # A[:,:,n,j,i] # C[:,j,i]
print(np.sum(np.abs(D-E))) # expected small enough
So far I use for-loop of i, j, and n, but I don't want to, at least for-loop of n.

Option 1: Calling numpy from MATLAB
Assuming your system is set up according to the documentation, and you have the numpy package installed, you could do (in MATLAB):
np = py.importlib.import_module('numpy');
M = 2;
N = 4;
I = 2000;
J = 300;
A = matpy.mat2nparray( randn(M, M, I) );
B = matpy.mat2nparray( randn(M, M, N, J, I) );
C = matpy.mat2nparray( randn(M, J, I) );
D = matpy.nparray2mat( np.einsum('mki, klnji, lji -> mnji', A, B, C) );
Where matpy can be found here.
Option 2: Native MATLAB
Here the most important part is to get the permutations right, so we need to keep track of our dimensions. We'll be using the following order:
I(1) J(2) K(3) L(4) M(5) N(6)
Now, I'll explain how I got the correct permute order (let's take the example of A): einsum expects the dimension order to be mki, which according to our numbering is 5 3 1. This tells us that the 1st dimension of A needs to be the 5th, the 2nd needs to be 3rd and the 3rd needs to be 1st (in short 1->5, 2->3, 3->1). This also means that the "sourceless dimensions" (meaning those that have no original dimensions becoming them; in this case 2 4 6) should be singleton. Using ipermute this is really simple to write:
pA = ipermute(A, [5,3,1,2,4,6]);
In the above example, 1->5 means we write 5 first, and the same goes for the other two dimensions (yielding [5,3,1]). Then we just add the singletons (2,4,6) at the end to get [5,3,1,2,4,6]. Finally:
A = randn(M, M, I);
B = randn(M, M, N, J, I);
C = randn(M, J, I);
% Reference dim order: I(1) J(2) K(3) L(4) M(5) N(6)
pA = ipermute(A, [5,3,1,2,4,6]); % 1->5, 2->3, 3->1; 2nd, 4th & 6th are singletons
pB = ipermute(B, [3,4,6,2,1,5]); % 1->3, 2->4, 3->6, 4->2, 5->1; 5th is singleton
pC = ipermute(C, [4,2,1,3,5,6]); % 1->4, 2->2, 3->1; 3rd, 5th & 6th are singletons
pD = sum( ...
permute(pA .* pB .* pC, [5,6,2,1,3,4]), ... 1->5, 2->6, 3->2, 4->1; 3rd & 4th are singletons
(see note regarding sum at the bottom of the post.)
Another way to do it in MATLAB, as mentioned by #AndrasDeak, is the following:
rD = squeeze(sum(reshape(A, [M, M, 1, 1, 1, I]) .* ...
reshape(B, [1, M, M, N, J, I]) .* ...
... % same as: reshape(B, [1, size(B)]) .* ...
... % same as: shiftdim(B,-1) .* ...
reshape(C, [1, 1, M, 1, J, I]), [2, 3]));
See also: squeeze, reshape, permute, ipermute, shiftdim.
Here's a full example that shows that tests whether these methods are equivalent:
function q55913093
M = 2;
N = 4;
I = 2000;
J = 300;
mA = randn(M, M, I);
mB = randn(M, M, N, J, I);
mC = randn(M, J, I);
%% Option 1 - using numpy:
np = py.importlib.import_module('numpy');
A = matpy.mat2nparray( mA );
B = matpy.mat2nparray( mB );
C = matpy.mat2nparray( mC );
D = matpy.nparray2mat( np.einsum('mki, klnji, lji -> mnji', A, B, C) );
%% Option 2 - native MATLAB:
%%% Reference dim order: I(1) J(2) K(3) L(4) M(5) N(6)
pA = ipermute(mA, [5,3,1,2,4,6]); % 1->5, 2->3, 3->1; 2nd, 4th & 6th are singletons
pB = ipermute(mB, [3,4,6,2,1,5]); % 1->3, 2->4, 3->6, 4->2, 5->1; 5th is singleton
pC = ipermute(mC, [4,2,1,3,5,6]); % 1->4, 2->2, 3->1; 3rd, 5th & 6th are singletons
pD = sum( permute( ...
pA .* pB .* pC, [5,6,2,1,3,4]), ... % 1->5, 2->6, 3->2, 4->1; 3rd & 4th are singletons
rD = squeeze(sum(reshape(mA, [M, M, 1, 1, 1, I]) .* ...
reshape(mB, [1, M, M, N, J, I]) .* ...
reshape(mC, [1, 1, M, 1, J, I]), [2, 3]));
%% Comparisons:
sum(abs(pD-D), 'all')
Running the above we get that the results are indeed equivalent:
>> q55913093
ans =
ans =
Note that these two methods of calling sum were introduced in recent releases, so you might need to replace them if your MATLAB is relatively old:
S = sum(A,'all') % can be replaced by ` sum(A(:)) `
S = sum(A,vecdim) % can be replaced by ` sum( sum(A, dim1), dim2) `
As requested in the comments, here's a benchmark comparing the methods:
function t = q55913093_benchmark(M,N,I,J)
if nargin == 0
M = 2;
N = 4;
I = 2000;
J = 300;
% Define the arrays in MATLAB
mA = randn(M, M, I);
mB = randn(M, M, N, J, I);
mC = randn(M, J, I);
% Define the arrays in numpy
np = py.importlib.import_module('numpy');
pA = matpy.mat2nparray( mA );
pB = matpy.mat2nparray( mB );
pC = matpy.mat2nparray( mC );
% Test for equivalence
D = cat(5, M1(), M2(), M3());
assert( sum(abs(D(:,:,:,:,1) - D(:,:,:,:,2)), 'all') < 1E-8 );
assert( isequal (D(:,:,:,:,2), D(:,:,:,:,3)));
% Time
t = [ timeit(#M1,1), timeit(#M2,1), timeit(#M3,1)];
function out = M1()
out = matpy.nparray2mat( np.einsum('mki, klnji, lji -> mnji', pA, pB, pC) );
function out = M2()
out = permute( ...
sum( ...
ipermute(mA, [5,3,1,2,4,6]) .* ...
ipermute(mB, [3,4,6,2,1,5]) .* ...
ipermute(mC, [4,2,1,3,5,6]), [3,4]...
), [5,6,2,1,3,4]...
function out = M3()
out = squeeze(sum(reshape(mA, [M, M, 1, 1, 1, I]) .* ...
reshape(mB, [1, M, M, N, J, I]) .* ...
reshape(mC, [1, 1, M, 1, J, I]), [2, 3]));
On my system this results in:
>> q55913093_benchmark
ans =
1.3964 0.1864 0.2428
Which means that the 2nd method is preferable (at least for the default input sizes).


I am looking for a way to find same eigenvectors for 2 given matrices, this way I would make a joint diagonalisation. For this, I found out and tried to use qndiag (from https://github.com/pierreablin/qndiag.git ) from the following function :
function [D, B, infos] = qndiag(C, varargin)
% Joint diagonalization of matrices using the quasi-Newton method
% The algorithm is detailed in:
% P. Ablin, J.F. Cardoso and A. Gramfort. Beyond Pham’s algorithm
% for joint diagonalization. Proc. ESANN 2019.
% https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2019-119.pdf
% https://hal.archives-ouvertes.fr/hal-01936887v1
% https://arxiv.org/abs/1811.11433
% The function takes as input a set of matrices of size `(p, p)`, stored as
% a `(n, p, p)` array, `C`. It outputs a `(p, p)` array, `B`, such that the
% matrices `B * C(i,:,:) * B'` are as diagonal as possible.
% There are several optional parameters which can be provided in the
% varargin variable.
% Optional parameters:
% --------------------
% 'B0' Initial point for the algorithm.
% If absent, a whitener is used.
% 'weights' Weights for each matrix in the loss:
% L = sum(weights * KL(C, C')).
% No weighting (weights = 1) by default.
% 'maxiter' (int) Maximum number of iterations to perform.
% Default : 1000
% 'tol' (float) A positive scalar giving the tolerance at
% which the algorithm is considered to have converged.
% The algorithm stops when |gradient| < tol.
% Default : 1e-6
% lambda_min (float) A positive regularization scalar. Each
% eigenvalue of the Hessian approximation below
% lambda_min is set to lambda_min.
% max_ls_tries (int), Maximum number of line-search tries to
% perform.
% return_B_list (bool) Chooses whether or not to return the list
% of iterates.
% verbose (bool) Prints informations about the state of the
% algorithm if True.
% Returns
% -------
% D : Set of matrices jointly diagonalized
% B : Estimated joint diagonalizer matrix.
% infos : structure containing monitoring informations, containing the times,
% gradient norms and objective values.
% Example:
% --------
% [D, B] = qndiag(C, 'maxiter', 100, 'tol', 1e-5)
% Authors: Pierre Ablin <pierre.ablin#inria.fr>
% Alexandre Gramfort <alexandre.gramfort#inria.fr>
% License: MIT
% First tests
if nargin == 0,
error('No signal provided');
if length(size(C)) ~= 3,
error('Input C should be 3 dimensional');
if ~isa (C, 'double'),
fprintf ('Converting input data to double...');
X = double(X);
% Default parameters
C_mean = squeeze(mean(C, 1));
[p, d] = eigs(C_mean);
p = fliplr(p);
d = flip(diag(d));
B = p' ./ repmat(sqrt(d), 1, size(p, 1));
max_iter = 1000;
tol = 1e-6;
lambda_min = 1e-4;
max_ls_tries = 10;
return_B_list = false;
verbose = false;
weights = [];
% Read varargin
if mod(length(varargin), 2) == 1,
error('There should be an even number of optional parameters');
for i = 1:2:length(varargin)
param = lower(varargin{i});
value = varargin{i + 1};
switch param
case 'B0'
B0 = value;
case 'max_iter'
max_iter = value;
case 'tol'
tol = value;
case 'weights'
weights = value / mean(value(:));
case 'lambda_min'
lambda_min = value;
case 'max_ls_tries'
max_ls_tries = value;
case 'return_B_list'
return_B_list = value;
case 'verbose'
verbose = value;
error(['Parameter ''' param ''' unknown'])
[n_samples, n_features, ~] = size(C);
D = transform_set(B, C, false);
current_loss = NaN;
% Monitoring
if return_B_list
B_list = []
t_list = [];
gradient_list = [];
loss_list = [];
if verbose
print('Running quasi-Newton for joint diagonalization');
print('iter | obj | gradient');
for t=1:max_iter
if return_B_list
B_list(k) = B;
diagonals = zeros(n_samples, n_features);
for k=1:n_samples
diagonals(k, :) = diag(squeeze(D(k, :, :)));
% Gradient
if isempty(weights)
G = squeeze(mean(bsxfun(#rdivide, D, ...
reshape(diagonals, n_samples, n_features, 1)), ...
1)) - eye(n_features);
G = squeeze(mean(...
bsxfun(#times, ...
reshape(weights, n_samples, 1, 1), ...
bsxfun(#rdivide, D, ...
reshape(diagonals, n_samples, n_features, 1))), ...
1)) - eye(n_features);
g_norm = norm(G);
if g_norm < tol
% Hessian coefficients
if isempty(weights)
h = mean(bsxfun(#rdivide, ...
reshape(diagonals, n_samples, 1, n_features), ...
reshape(diagonals, n_samples, n_features, 1)), 1);
h = mean(bsxfun(#times, ...
reshape(weights, n_samples, 1, 1), ...
bsxfun(#rdivide, ...
reshape(diagonals, n_samples, 1, n_features), ...
reshape(diagonals, n_samples, n_features, 1))), ...
h = squeeze(h);
% Quasi-Newton's direction
dt = h .* h' - 1.;
dt(dt < lambda_min) = lambda_min; % Regularize
direction = -(G .* h' - G') ./ dt;
% Line search
[success, new_D, new_B, new_loss, direction] = ...
linesearch(D, B, direction, current_loss, max_ls_tries, weights);
D = new_D;
B = new_B;
current_loss = new_loss;
% Monitoring
gradient_list(t) = g_norm;
loss_list(t) = current_loss;
if verbose
print(sprintf('%d - %.2e - %.2e', t, current_loss, g_norm))
infos = struct();
infos.t_list = t_list;
infos.gradient_list = gradient_list;
infos.loss_list = loss_list;
if return_B_list
infos.B_list = B_list
function [op] = transform_set(M, D, diag_only)
[n, p, ~] = size(D);
if ~diag_only
op = zeros(n, p, p);
for k=1:length(D)
op(k, :, :) = M * squeeze(D(k, :, :)) * M';
op = zeros(n, p);
for k=1:length(D)
op(k, :) = sum(M .* (squeeze(D(k, :, :)) * M'), 1);
function [v] = slogdet(A)
v = log(abs(det(A)));
function [out] = loss(B, D, is_diag, weights)
[n, p, ~] = size(D);
if ~is_diag
diagonals = zeros(n, p);
for k=1:n
diagonals(k, :) = diag(squeeze(D(k, :, :)));
diagonals = D;
logdet = -slogdet(B);
if ~isempty(weights)
diagonals = bsxfun(#times, diagonals, reshape(weights, n, 1));
out = logdet + 0.5 * sum(log(diagonals(:))) / n;
function [success, new_D, new_B, new_loss, delta] = linesearch(D, B, direction, current_loss, n_ls_tries, weights)
[n, p, ~] = size(D);
step = 1.;
if current_loss == NaN
current_loss = loss(B, D, false);
success = false;
for n=1:n_ls_tries
M = eye(p) + step * direction;
new_D = transform_set(M, D, true);
new_B = M * B;
new_loss = loss(new_B, new_D, true, weights);
if new_loss < current_loss
success = true;
step = step / 2;
new_D = transform_set(M, D, false);
delta = step * direction;
I use it with the following script that you can test with downloading the 2 input matrices at the bottom of this post :
clc; clear
m=7 % dimension
n=2 % number of matrices
% Load spectro and WL+GCph+XC
FISH_GCsp = load('Fisher_GCsp_flat.txt');
FISH_XC = load('Fisher_XC_GCph_WL_flat.txt');
% Marginalizing over uncommon parameters between the two matrices
COV_GCsp_first = inv(FISH_GCsp);
COV_XC_first = inv(FISH_XC);
COV_GCsp = COV_GCsp_first(1:m,1:m);
COV_XC = COV_XC_first(1:m,1:m);
% Invert to get Fisher matrix
FISH_sp = inv(COV_GCsp);
FISH_xc = inv(COV_XC);
% Drawing a random set of commuting matrices
C(1,:,:) = FISH_sp
C(2,:,:) = FISH_xc
%[D, B] = qndiag(C, 'max_iter', 1e6, 'tol', 1e-6);
[D, B] = qndiag(C);
% Print diagonal matrices
But unfortunately, I get the following error :
Unable to perform assignment because the size of the left side is 1-by-7-by-7 and the size of the
right side is 6-by-6.
Error in qndiag>transform_set (line 224)
op(k, :, :) = M * squeeze(D(k, :, :)) * M';
Error in qndiag (line 128)
D = transform_set(B, C, false);
Error in compute_joint_diagonalization (line 32)
[D, B] = qndiag(C);
I don't understand the utility of function squeeze the most important : why the function eigs returns only 6 values and not 7 like in my data (the input matrices has 7x7 size).
What might be wrong with this issue of array dimension and how can I fix it ?
I put the 2 input files available here :
Matrix Fisher_GCsp_flat.txt
Matrix Fisher_XC_GCph_WL_flat.txt
You can test the above code that calls qndiag for these 2 matrices.
Update 1
To allow people interested to test quickly the code, I put a link of the archive:
You just have to untar and execute under Matlab the script compute_joint_diagonalization.m and you will see normally the above error (regarding the use of eigs and squeeze functions).
It should help you understand the origin of this issue.
Update 2
If I replace [p, d] = eigs(C_mean) by [p, d] = eigs(C_mean,7) , I get another error :
Index in position 1 exceeds array bounds (must not exceed 2).
Error in qndiag>transform_set (line 224)
op(k, :, :) = M * squeeze(D(k, :, :)) * M';
Error in qndiag (line 128)
D = transform_set(B, C, false);
Error in compute_joint_diagonalization (line 27)
[D, B] = qndiag(C);
However, the dimensions of the 2 matrices used are 7x7 and should be correctly processed with eigs(C_mean,7).
Update 3
The size of op, D, M and k are equal to (including after the error message) :
size(D) =
2 7 7
length(D) =
size(M) =
7 7
size(op) =
2 7 7
Index in position 1 exceeds array bounds (must not exceed 2).
Error in qndiag>transform_set (line 231)
op(k, :, :) = M * squeeze(D(k, :, :)) * M';
Error in qndiag (line 128)
D = transform_set(B, C, false);
Error in compute_joint_diagonalization (line 27)
[D, B] = qndiag(C);
Notice that k varies from 1 to length(D)=7.
Is there an issue which could appear with these dimensions ?
From the documentation for eigs:
d = eigs(A) returns a vector of the six largest magnitude eigenvalues of matrix A.
If you want all seven, you need to call d = eigs(A,7) or d = eig(A). For a small matrix (e.g. < 1000 x 1000) it's usually easier to just get all the eigenvalues with eig, rather than get a subset with eigs.
Edit: Responding to your "Update 3"
for k=1:length(D) should be replaced by for k=1:n. This needs to be changed on two lines. Judging from your error message they are lines 231 and 236.
L = length(X) returns the length of the largest array dimension in X, which in your case is 7, i.e. too high for the first dimension.

How can I express this large number of computations without for loops?

I work primarily in MATLAB but I think the answer should not be too hard to carry over from one language to another.
I have a multi-dimensional array X with dimensions [n, p, 3].
I would like to calculate the following multi-dimensional array.
T = zeros(p, p, p)
for i = 1:p
for j = 1:p
for k = 1:p
T(i, j, k) = sum(X(:, i, 1) .* X(:, j, 2) .* X(:, k, 3));
The sum is of the elements of a length-n vector. Any help is appreciated!
You only need some permuting of dimensions and multiplication with singleton expansion:
T = sum(bsxfun(#times, bsxfun(#times, permute(X(:,:,1), [2 4 5 3 1]), permute(X(:,:,2), [4 2 5 3 1])), permute(X(:,:,3), [4 5 2 3 1])), 5);
From R2016b onwards, this can be written more easily as
T = sum(permute(X(:,:,1), [2 4 5 3 1]) .* permute(X(:,:,2), [4 2 5 3 1]) .* permute(X(:,:,3), [4 5 2 3 1]), 5);
As I mentioned in a comment, vectorization is not always a huge advantage any more. Therefore there are vectorization methods that slow down the code rather than speed it up. You must always time your solutions. Vectorization often involves the creation of large temporary arrays, or the copy of large amounts of data, which are avoided in loop code. It depends on the architecture, the size of the input, and many other factors if such a solution is going to be faster.
Nonetheless, in this case it seems vectorization approaches can yield a large speedup.
The first thing to notice about the original code is that X(:, i, 1) .* X(:, j, 2) gets re-computed in the inner loop, though it is a constant value there. Rewriting the inner loop as this will save time:
Y = X(:, i, 1) .* X(:, j, 2);
for k = 1:p
T(i, j, k) = sum(Y .* X(:, k, 3));
Now we notice that the inner loop is a dot product, and can be written as follows:
Y = X(:, i, 1) .* X(:, j, 2);
T(i, j, :) = Y.' * X(:, :, 3);
The .' transpose on Y does not copy the data, as Y is a vector. Next, we notice that X(:, :, 3) is indexed repeatedly. Let's move this out of the outer loop. Now I'm left with the following code:
T = zeros(p, p, p);
X1 = X(:, :, 1);
X2 = X(:, :, 2);
X3 = X(:, :, 3);
for i = 1:p
for j = 1:p
Y = X1(:, i) .* X2(:, j);
T(i, j, :) = Y.' * X3;
It is likely that removing the loop over j is equally easy, which would leave a single loop over i. But this is where I stop.
This is the timings I see (R2017a, 3-year old iMac with 4 cores). For n=10, p=20:
original: 0.0206
moving Y out the inner loop: 0.0100
removing inner loop: 0.0016
moving indexing out of loops: 7.6294e-04
Luis' answer: 1.9196e-04
For a larger array with n=50, p=100:
original: 2.9107
moving Y out the inner loop: 1.3488
removing inner loop: 0.0910
moving indexing out of loops: 0.0361
Luis' answer: 0.1417
"Luis' answer" is this one. It is by far fastest for small arrays, but for larger arrays it shows the cost of the permutation. Moving the computation of the first product out of the inner loop saves a bit over half the computation cost. But removing the inner loop reduces the cost quite dramatically (which I hadn't expected, I presume the single matrix product can use parallelism better than the many small element-wise products). We then get a further time reduction by reducing the amount of indexing operations within the loop.
This is the timing code:
function so()
n = 10; p = 20;
%n = 50; p = 100;
X = randn(n,p,3);
T1 = method1(X);
T2 = method2(X);
T3 = method3(X);
T4 = method4(X);
T5 = method5(X);
function T = method1(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
for k = 1:p
T(i, j, k) = sum(X(:, i, 1) .* X(:, j, 2) .* X(:, k, 3));
function T = method2(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
Y = X(:, i, 1) .* X(:, j, 2);
for k = 1:p
T(i, j, k) = sum(Y .* X(:, k, 3));
function T = method3(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
Y = X(:, i, 1) .* X(:, j, 2);
T(i, j, :) = Y.' * X(:, :, 3);
function T = method4(X)
p = size(X,2);
T = zeros(p, p, p);
X1 = X(:, :, 1);
X2 = X(:, :, 2);
X3 = X(:, :, 3);
for i = 1:p
for j = 1:p
Y = X1(:, i) .* X2(:, j);
T(i, j, :) = Y.' * X3;
function T = method5(X)
T = sum(permute(X(:,:,1), [2 4 5 3 1]) .* permute(X(:,:,2), [4 2 5 3 1]) .* permute(X(:,:,3), [4 5 2 3 1]), 5);
You have mentioned you are open to other languages and NumPy by its syntax is very close to MATLAB, so we will try to have a NumPy based solution on this.
Now, these tensor related sum-reductions, specially matrix multiplications ones are easily expressed as einstein-notation and NumPy luckily has one function on the same as np.einsum. Under the hoods, it's implemented in C and is pretty efficient. Recently it's been optimized further to leverage BLAS based matrix-multiplication implementations.
So, a translation of the stated code onto NumPy territory keeping in mind that it follows 0-based indexing and the axes are visuallized differently than the dimensions with MATLAB, would be -
import numpy as np
# X is a NumPy array of shape : (n,p,3). So, a random one could be
# generated with : `X = np.random.rand(n,p,3)`.
T = np.zeros((p, p, p))
for i in range(p):
for j in range(p):
for k in range(p):
T[i, j, k] = np.sum(X[:, i, 0] * X[:, j, 1] * X[:, k, 2])
The einsum way to solve it would be -
To leverage matrix-multiplication, use optimize flag -
Timings (with large sizes)
In [27]: n,p = 100,100
...: X = np.random.rand(n,p,3)
In [28]: %%timeit
...: T = np.zeros((p, p, p))
...: for i in range(p):
...: for j in range(p):
...: for k in range(p):
...: T[i, j, k] = np.sum(X[:, i, 0] * X[:, j, 1] * X[:, k, 2])
1 loop, best of 3: 6.23 s per loop
In [29]: %timeit np.einsum('ia,ib,ic->abc',X[...,0],X[...,1],X[...,2])
1 loop, best of 3: 353 ms per loop
In [31]: %timeit np.einsum('ia,ib,ic->abc',X[...,0],X[...,1],X[...,2],optimize=True)
100 loops, best of 3: 10.5 ms per loop
In [32]: 6230.0/10.5
Out[32]: 593.3333333333334
Around 600x speedup there!

Precompute weights for multidimensional linear interpolation

I have a non-uniform rectangular grid along D dimensions, a matrix of logical values V on the grid, and a matrix of query data points X. The number of grid points differs across dimensions.
I run the interpolation multiple times for the same grid G and query X, but for different values V.
The goal is to precompute the indexes and weights for the interpolation and to reuse them, because they are always the same.
Here is an example in 2 dimensions, in which I have to compute indexes and values every time within the loop, but I want to compute them only once before the loop. I keep the data types from my application (mostly single and logical gpuArrays).
% Define grid
G{1} = single([0; 1; 3; 5; 10]);
G{2} = single([15; 17; 18; 20]);
% Steps and edges are reduntant but help make interpolation a bit faster
S{1} = G{1}(2:end)-G{1}(1:end-1);
S{2} = G{2}(2:end)-G{2}(1:end-1);
gpuInf = 1e10;
% It's my workaround for a bug in GPU version of discretize in Matlab R2017a.
% It throws an error if edges contain Inf, realmin, or realmax. Seems fixed in R2017b prerelease.
E{1} = [-gpuInf; G{1}(2:end-1); gpuInf];
E{2} = [-gpuInf; G{2}(2:end-1); gpuInf];
% Generate query points
n = 50; X = gpuArray(single([rand(n,1)*14-2, 14+rand(n,1)*7]));
[G1, G2] = ndgrid(G{1},G{2});
for i = 1 : 4
% Generate values on grid
foo = #(x1,x2) (sin(x1+rand) + cos(x2*rand))>0;
V = gpuArray(foo(G1,G2));
% Interpolate
V_interp = interpV(X, V, G, E, S);
% Plot results
contourf(G1, G2, V); hold on;
scatter(X(:,1), X(:,2),50,[ones(n,1), 1-V_interp, 1-V_interp],'filled', 'MarkerEdgeColor','black'); hold off;
function y = interpV(X, V, G, E, S)
y = min(1, max(0, interpV_helper(X, 1, 1, 0, [], V, G, E, S) ));
function y = interpV_helper(X, dim, weight, curr_y, index, V, G, E, S)
if dim == ndims(V)+1
M = [1,cumprod(size(V),2)];
idx = 1 + (index-1)*M(1:end-1)';
y = curr_y + weight .* single(V(idx));
x = X(:,dim); grid = G{dim}; edges = E{dim}; steps = S{dim};
iL = single(discretize(x, edges));
weightL = weight .* (grid(iL+1) - x) ./ steps(iL);
weightH = weight .* (x - grid(iL)) ./ steps(iL);
y = interpV_helper(X, dim+1, weightL, curr_y, [index, iL ], V, G, E, S) +...
interpV_helper(X, dim+1, weightH, curr_y, [index, iL+1], V, G, E, S);
I found a way to do this and posting it here because (as of now) two more people are interested. It takes only a slight modification to my original code (see below).
% Define grid
G{1} = single([0; 1; 3; 5; 10]);
G{2} = single([15; 17; 18; 20]);
% Steps and edges are reduntant but help make interpolation a bit faster
S{1} = G{1}(2:end)-G{1}(1:end-1);
S{2} = G{2}(2:end)-G{2}(1:end-1);
gpuInf = 1e10;
% It's my workaround for a bug in GPU version of discretize in Matlab R2017a.
% It throws an error if edges contain Inf, realmin, or realmax. Seems fixed in R2017b prerelease.
E{1} = [-gpuInf; G{1}(2:end-1); gpuInf];
E{2} = [-gpuInf; G{2}(2:end-1); gpuInf];
% Generate query points
n = 50; X = gpuArray(single([rand(n,1)*14-2, 14+rand(n,1)*7]));
[G1, G2] = ndgrid(G{1},G{2});
[W, I] = interpIW(X, G, E, S); % Precompute weights W and indexes I
for i = 1 : 4
% Generate values on grid
foo = #(x1,x2) (sin(x1+rand) + cos(x2*rand))>0;
V = gpuArray(foo(G1,G2));
% Interpolate
V_interp = sum(W .* single(V(I)), 2);
% Plot results
contourf(G1, G2, V); hold on;
scatter(X(:,1), X(:,2), 50,[ones(n,1), 1-V_interp, 1-V_interp],'filled', 'MarkerEdgeColor','black'); hold off;
function [W, I] = interpIW(X, G, E, S)
global Weights Indexes
Weights=[]; Indexes=[];
interpIW_helper(X, 1, 1, [], G, E, S, []);
W = Weights; I = Indexes;
function [] = interpIW_helper(X, dim, weight, index, G, E, S, sizeV)
global Weights Indexes
if dim == size(X,2)+1
M = [1,cumprod(sizeV,2)];
Weights = [Weights, weight];
Indexes = [Indexes, 1 + (index-1)*M(1:end-1)'];
x = X(:,dim); grid = G{dim}; edges = E{dim}; steps = S{dim};
iL = single(discretize(x, edges));
weightL = weight .* (grid(iL+1) - x) ./ steps(iL);
weightH = weight .* (x - grid(iL)) ./ steps(iL);
interpIW_helper(X, dim+1, weightL, [index, iL ], G, E, S, [sizeV, size(grid,1)]);
interpIW_helper(X, dim+1, weightH, [index, iL+1], G, E, S, [sizeV, size(grid,1)]);
To do the task the whole process of interpolation ,except computing the interpolated values, should be done. Here is a solution translated from the Octave c++ source. Format of the input is the same as the frst signature of the interpn function except that there is no need to the v array. Also Xs should be vectors and should not be of the ndgrid format. Both the outputs W (weights) and I (positions) have the size (a ,b) that a is the number of neighbors of a points on the grid and b is the number of requested points to be interpolated.
function [W , I] = lininterpnw(varargin)
% [W I] = lininterpnw(X1,X2,...,Xn,Xq1,Xq2,...,Xqn)
n = numel(varargin)/2;
x = varargin(1:n);
y = varargin(n+1:end);
sz = cellfun(#numel,x);
scale = [1 cumprod(sz(1:end-1))];
Ni = numel(y{1});
index = zeros(n,Ni);
x_before = zeros(n,Ni);
x_after = zeros(n,Ni);
for ii = 1:n
jj = interp1(x{ii},1:sz(ii),y{ii},'previous');
index(ii,:) = jj-1;
x_before(ii,:) = x{ii}(jj);
x_after(ii,:) = x{ii}(jj+1);
coef(2:2:2*n,1:Ni) = (vertcat(y{:}) - x_before) ./ (x_after - x_before);
coef(1:2:end,:) = 1 - coef(2:2:2*n,:);
bit = permute(dec2bin(0:2^n-1)=='1', [2,3,1]);
%I = reshape(1+scale*bsxfun(#plus,index,bit), Ni, []).'; %Octave
I = reshape(1+sum(bsxfun(#times,scale(:),bsxfun(#plus,index,bit))), Ni, []).';
W = squeeze(prod(reshape(coef(bsxfun(#plus,(1:2:2*n).',bit),:).',Ni,n,[]),2)).';
x={[1 3 8 9],[2 12 13 17 25]};
v = rand(4,5);
y={[1.5 1.6 1.3 3.5,8.1,8.3],[8.4,13.5,14.4,23,23.9,24.2]};
[W I]=lininterpnw(x{:},y{:});
Thanks to #SardarUsama for testing and his useful comments.

Triple weighted sum

I was trying to vectorize a certain weighted sum but couldn't figure out how to do it. I have created a simple minimal working example below. I guess the solution involves either bsxfun or reshape and kronecker products but I still have not managed to get it working.
N = 200;
T1 = 5;
T2 = 7;
T3 = 10;
A = rand(N,T1,T2,T3);
w1 = rand(T1,1);
w2 = rand(T2,1);
w3 = rand(T3,1);
B = zeros(N,1);
for i = 1:N
for j1=1:T1
for j2=1:T2
for j3=1:T3
B(i) = B(i) + w1(j1) * w2(j2) * w3(j3) * A(i,j1,j2,j3);
A = B;
For the two dimensional case there is a smart answer here.
You can use an additional multiplication to modify the w1 * w2' grid from the previous answer to then multiply by w3 as well. You can then use matrix multiplication again to multiply with a "flattened" version of A.
W = reshape(w1 * w2.', [], 1) * w3.';
B = reshape(A, size(A, 1), []) * W(:);
You could wrap the creation of weights into it's own function and make this generalizable to N weights. Since this uses recursion, N is limited to your current recursion limit (500 by default).
function W = createWeights(W, varargin)
if numel(varargin) > 0
W = createWeights(W(:) * varargin{1}(:).', varargin{2:end});
And use it with:
W = createWeights(w1, w2, w3);
B = reshape(A, size(A, 1), []) * W(:);
Using part of #CKT's very good suggestion to use kron, we could modify createWeights just a little bit.
function W = createWeights(W, varargin)
if numel(varargin) > 0
W = createWeights(kron(varargin{1}, W), varargin{2:end});
Again, you couldn't generalize this that well for N-D unless you made some function to construct the Kronecker product vector, but how about
A = reshape(A, N, []) * kron(w3, kron(w2, w1));
This is the logic behind it:
ww1 = repmat (permute (w1, [4, 1, 2, 3]), [N, 1, T2, T3]);
ww2 = repmat (permute (w2, [3, 4, 1, 2]), [N, T1, 1, T3]);
ww3 = repmat (permute (w3, [2, 3, 4, 1]), [N, T1, T2, 1 ]);
B = ww1 .* ww2 .* ww3 .* A;
B = sum (B(:,:), 2)
You can avoid permute by creating w1, w2, and w3 in the appropriate dimension in the first place. Also you can use bsxfun instead of repmat as appropriate for extra performance, I'm just showing the logic here and repmat is easier to follow.
EDIT: Generalised version for arbitrary input dimensions:
Dims = {N, T1, T2, T3}; % add T4, T5, T6, etc as appropriate
Params = cell (1, length (Dims));
Params{1} = rand (Dims{:});
for n = 2 : length (Dims)
DimSubscripts = ones (1, length (Dims)); DimSubscripts(n) = Dims{n};
RepSubscripts = [Dims{:}]; RepSubscripts(n) = 1;
Params{n} = repmat (rand (DimSubscripts), RepSubscripts);
B = times (Params{:});
B = sum (B(:,:), 2)
If we're going the route of having functions anyway, and are favoring performance over elegance/brevity, then consider this:
function B = weightReduce(A, varargin)
B = A;
for i = length(varargin):-1:1
N = length(varargin{i});
B = reshape(B, [], N) * varargin{i};
This is the performance comparison I see:
for i = 1:10000
W = createWeights(w1,w2,w3);
B = reshape(A, size(A,1), [])*W(:);
Elapsed time is 0.920821 seconds.
for i = 1:10000
B2 = weightReduce(A, w1, w2, w3);
Elapsed time is 0.484470 seconds.

Vectorizing a nested for loop which fills a dynamic programming table

I was wondering if there was a way to vectorize the nested for loop in this function which is filling up the entries of the 2D dynamic programming table DP. I believe that at the very least the inner loop could be vectorized as each row only depends on the previous row. I'm not sure how to do it though. Note this function is called on large 2D arrays (images) so the nested for loop really doesn't cut it.
function [cols] = compute_seam(energy)
[r, c, ~] = size(energy);
cols = zeros(r);
DP = padarray(energy, [0, 1], Inf);
BP = zeros(r, c);
for i = 2 : r
for j = 1 : c
[x, l] = min([DP(i - 1, j), DP(i - 1, j + 1), DP(i - 1, j + 2)]);
DP(i, j + 1) = DP(i, j + 1) + x;
BP(i, j) = j + (l - 2);
[~, j] = min(DP(r, :));
j = j - 1;
for i = r : -1 : 1
cols(i) = j;
j = BP(i, j);
Vectorization of the innermost nested loop
You were right in postulating that at least the inner loop is vectorizable. Here's the modified code for the nested loops part -
rows_DP = size(DP,1); %// rows in DP
%// Get first row linear indices for a group of neighboring three columns,
%// which would be incremented as we move between rows with the row iterator
start_ind1 = bsxfun(#plus,[1:rows_DP:2*rows_DP+1]',[0:c-1]*rows_DP); %//'
for i = 2 : r
ind1 = start_ind1 + i-2; %// setup linear indices for the row of this iteration
[x,l] = min(DP(ind1),[],1); %// get x and l values in one go
DP(i,2:c+1) = DP(i,2:c+1) + x; %// set DP values of a row in one go
BP(i,1:c) = [1:c] + l-2; %// set BP values of a row in one go
Benchmarking Code -
N = 3000; %// Datasize
energy = rand(N);
[r, c, ~] = size(energy);
disp('------------------------------------- With Original Code')
DP = padarray(energy, [0, 1], Inf);
BP = zeros(r, c);
for i = 2 : r
for j = 1 : c
[x, l] = min([DP(i - 1, j), DP(i - 1, j + 1), DP(i - 1, j + 2)]);
DP(i, j + 1) = DP(i, j + 1) + x;
BP(i, j) = j + (l - 2);
toc,clear DP BP x l
disp('------------------------------------- With Vectorized Code')
DP = padarray(energy, [0, 1], Inf);
BP = zeros(r, c);
rows_DP = size(DP,1); %// rows in DP
start_ind1 = bsxfun(#plus,[1:rows_DP:2*rows_DP+1]',[0:c-1]*rows_DP); %//'
for i = 2 : r
ind1 = start_ind1 + i-2; %// setup linear indices for the row of this iteration
[x,l] = min(DP(ind1),[],1); %// get x and l values in one go
DP(i,2:c+1) = DP(i,2:c+1) + x; %// set DP values of a row in one go
BP(i,1:c) = [1:c] + l-2; %// set BP values of a row in one go
Results -
------------------------------------- With Original Code
Elapsed time is 44.200746 seconds.
------------------------------------- With Vectorized Code
Elapsed time is 1.694288 seconds.
Thus, you might enjoy a good 26x speedup improvement in performance with that little vectorization tweak.
More tweaks
Few more optimization tweaks could be tried into your code for performance -
cols = zeros(r) could be replaced with col(r,r) = 0.
DP = padarray(energy, [0, 1], Inf) could be replaced with
DP(:,2:end-1) = energy;
BP = zeros(r, c) could be replaced with BP(r, c) = 0.
The pre-allocation tweaks used here are inspired by this blog post.