Iterating in a matrix avoiding loop in MATLAB - matlab

I am posing an interesting and useful question that needs to be carried out in MATLAB. It is about efficiency of programming by avoiding using Loops"
Assume a matrix URm whose columns are products and rows are people. The matrix entries are rating of people to these products, and this matrix is sparse as each person normally rates only few products.
URm [n_u, n_i]
Another matrix of interest is F, which contains attribute for each of the products and the attribute is of fixed length:
F [n_f,n_i]
We divide the URm into two sub-matrices randomly: URmTrain and URmTest where the former is used for training the system and the latter for test. These two matrices have similar rows (users) but they could have different number of columns (products).
We can find the similarity between items very fast using pdist() or Matrix transpose:
S = F * F' ;
For each row (user) in URmTest:
URmTestp = zeros(size(URmTest));
u = 1 ; %% Example user 1
for i = 1 : size(URmTest,2)
indTrain = find(URmTrain(u,:)) ; % For each user, search for items in URmTrain that have been rated by the the user (i.e. the have a rating greater than zero)
for j = 1 : length(indTrain)
URmTestp(u,i) = URmTestp(u,i) + S(i,indTrain(j))*URmTrain(u,indTrain(j))
end
end
where URmp is the predicted version of URm and we can compute an error on how good our prediction has been.
Example
Lets's make a simple example. Let's assume the items user 1 has rated items 3 , 5 and 17:
indTrain = [3 5 17]
For each item j in URmTest, I want to predict the rating using the following formula:
URmTestp(u,j) = S(j,3)*URmTrain(u,3) + S(j,5)*URmTrain(u,5) + S(j,17)*URmTrain(u,17)
Once completed this process needs to be repeated for all users.
As URm is typically very big, I prefer options which use least amount of 'loops'. We may be able to take advantage of bsxfun but I am not sure if we can.
Please suggest me ides that can help on accelerating this process as rapid as possible. Thank you

I'm still not sure I completely understand your problem. But it seems to me that if you pre-compute s_ij as
s_ij = F.' * F %'// [ni x ni] matrix
then what you're after is simply
URmTestp(u,indTest) = URmTrain(u,indTrain) * s_ij(indTrain,indTest);
% or
%URmTestp(u,:) = URmTrain(u,indTrain) * s_ij(indTrain,:);
or if you only compute a smaller s_ij block only for the necessary arrays,
s_ij = F(:,indTrain).' * F(:,indTest);
then
URmTestp(u,indTest) = URmTrain(u,indTrain) * s_ij;
Alternatively, you can always compute the necessary subblock of s_ij on the fly:
URmTestp(u,indTest) = URmTrainp(u,indTrain) * F(:,indTrain).'*F(:,indTest);
If I understand correctly that indTest and indTrain are functions of u, such as
URmTestp = zeros(n_u,n_i); %// pre-allocate here!
for u=1:n_u
indTest = testCell{u};
indTrain = trainCell{u};
URmTestp(u,indTest) = URmTrainp(u,indTrain) * F(:,indTrain).'*F(:,indTest); %'
...
end
then probably not much can be vectorized on this loop, unless there's a very tricky indexing scheme that allows you to use linear indices. I'd stick with this setup.

Related

Extract data from multidimentional array into 2 dims based on index

I have a huge (1000000x100x7) matrix and i need to create a (1000000x100x1) matrix based on an index vector (100x1) which holds 1 2 3 4 5 6 or 7 for each location.
I do not want to use loops
The problem (I think)
First, let me try create a minimum working example that I think captures what you want to do. You have a matrix A and an index vector index:
A = rand(1000000, 100, 7);
index = randi(7, [100, 1]);
And you would like to do something like this:
[I,J,K] = size(A);
B = zeros(I,J);
for i=1:I
for j=1:J
B(i,j) = A(i,j,index(j));
end
end
Only you'd like to do so without the loops.
Linear indexing
One way to do this is by using linear indexing. This is kinda a tricky thing that depends on how the matrix is laid out in memory, and I'm gonna do a really terrible job explaining it, but you can also check out the documentation for the sub2ind and ind2sub functions.
Anyways, it means that given your (1,000,000 x 100 x 7) matrix stored in column-major format, you can refer to the same element in many different ways, i.e.:
A(i, j, k)
A(i, j + 100*(k-1))
A(i + 1000000*(j-1 + 100*(k-1)))
all refer to the same element of the matrix. Anyways, the punchline is:
linear_index = (1:J)' + J*(index-1);
B_noloop = A(:, linear_index);
And of course we should verify that this produces the same answer:
>> isequal(B, B_noloop)
ans =
1
Yay!
Performance vs. readability
So testing this on my computer, the nested loops took 5.37 seconds and the no-loop version took 0.29 seconds. However, it's kinda hard to tell what's going on in that code. Perhaps a more reasonable compromise would be:
B_oneloop = zeros(I,J);
for j=1:J
B_oneloop(:,j) = A(:,j,index(j));
end
which vectorizes the longest dimension of the matrix and thus gets most of the way there (0.43 seconds), but maintains the readability of the original code.

MATLAB: Find abbreviated version of matrix that minimises sum of matrix elements

I have a 151-by-151 matrix A. It's a correlation matrix, so there are 1s on the main diagonal and repeated values above and below the main diagonal. Each row/column represents a person.
For a given integer n I will seek to reduce the size of the matrix by kicking people out, such that I am left with a n-by-n correlation matrix that minimises the total sum of the elements. In addition to obtaining the abbreviated matrix, I also need to know the row number of the people who should be booted out of the original matrix (or their column number - they'll be the same number).
As a starting point I take A = tril(A), which will remove redundant off-diagonal elements from the correlation matrix.
So, if n = 4 and we have the hypothetical 5-by-5 matrix above, it's very clear that person 5 should be kicked out of the matrix, since that person is contributing a lot of very high correlations.
It's also clear that person 1 should not be kicked out, since that person contributes a lot of negative correlations, and thus brings down the sum of the matrix elements.
I understand that sum(A(:)) will sum everything in the matrix. However, I'm very unclear about how to search for the minimum possible answer.
I noticed a similar question Finding sub-matrix with minimum elementwise sum, which has a brute force solution as the accepted answer. While that answer works fine there it's impractical for a 151-by-151 matrix.
EDIT: I had thought of iterating, but I don't think that truly minimizes the sum of elements in the reduced matrix. Below I have a 4-by-4 correlation matrix in bold, with sums of rows and columns on the edges. It's apparent that with n = 2 the optimal matrix is the 2-by-2 identity matrix involving Persons 1 and 4, but according to the iterative scheme I would have kicked out Person 1 in the first phase of iteration, and so the algorithm makes a solution that is not optimal. I wrote a program that always generated optimal solutions, and it works well when n or k are small, but when trying to make an optimal 75-by-75 matrix from a 151-by-151 matrix I realised my program would take billions of years to terminate.
I vaguely recalled that sometimes these n choose k problems can be resolved with dynamic programming approaches that avoid recomputing things, but I can't work out how to solve this, and nor did googling enlighten me.
I'm willing to sacrifice precision for speed if there's no other option, or the best program will take more than a week to generate a precise solution. However, I'm happy to let a program run for up to a week if it will generate a precise solution.
If it's not possible for a program to optimise the matrix within an reasonable timeframe, then I would accept an answer that explains why n choose k tasks of this particular sort can't be resolved within reasonable timeframes.
This is an approximate solution using a genetic algorithm.
I started with your test case:
data_points = 10; % How many data points will be generated for each person, in order to create the correlation matrix.
num_people = 25; % Number of people initially.
to_keep = 13; % Number of people to be kept in the correlation matrix.
to_drop = num_people - to_keep; % Number of people to drop from the correlation matrix.
num_comparisons = 100; % Number of times to compare the iterative and optimization techniques.
for j = 1:data_points
rand_dat(j,:) = 1 + 2.*randn(num_people,1); % Generate random data.
end
A = corr(rand_dat);
then I defined the functions you need to evolve the genetic algorithm:
function individuals = user1205901individuals(nvars, FitnessFcn, gaoptions, num_people)
individuals = zeros(num_people,gaoptions.PopulationSize);
for cnt=1:gaoptions.PopulationSize
individuals(:,cnt)=randperm(num_people);
end
individuals = individuals(1:nvars,:)';
is the individual generation function.
function fitness = user1205901fitness(ind, A)
fitness = sum(sum(A(ind,ind)));
is the fitness evaluation function
function offspring = user1205901mutations(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation, num_people)
offspring=zeros(length(parents),nvars);
for cnt=1:length(parents)
original = thisPopulation(parents(cnt),:);
extraneus = setdiff(1:num_people, original);
original(fix(rand()*nvars)+1) = extraneus(fix(rand()*(num_people-nvars))+1);
offspring(cnt,:)=original;
end
is the function to mutate an individual
function children = user1205901crossover(parents, options, nvars, FitnessFcn, unused, thisPopulation)
children=zeros(length(parents)/2,nvars);
cnt = 1;
for cnt1=1:2:length(parents)
cnt2=cnt1+1;
male = thisPopulation(parents(cnt1),:);
female = thisPopulation(parents(cnt2),:);
child = union(male, female);
child = child(randperm(length(child)));
child = child(1:nvars);
children(cnt,:)=child;
cnt = cnt + 1;
end
is the function to generate a new individual coupling two parents.
At this point you can define your problem:
gaproblem2.fitnessfcn=#(idx)user1205901fitness(idx,A)
gaproblem2.nvars = to_keep
gaproblem2.options = gaoptions()
gaproblem2.options.PopulationSize=40
gaproblem2.options.EliteCount=10
gaproblem2.options.CrossoverFraction=0.1
gaproblem2.options.StallGenLimit=inf
gaproblem2.options.CreationFcn= #(nvars,FitnessFcn,gaoptions)user1205901individuals(nvars,FitnessFcn,gaoptions,num_people)
gaproblem2.options.CrossoverFcn= #(parents,options,nvars,FitnessFcn,unused,thisPopulation)user1205901crossover(parents,options,nvars,FitnessFcn,unused,thisPopulation)
gaproblem2.options.MutationFcn=#(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation) user1205901mutations(parents, options, nvars, FitnessFcn, state, thisScore, thisPopulation, num_people)
gaproblem2.options.Vectorized='off'
open the genetic algorithm tool
gatool
from the File menu select Import Problem... and choose gaproblem2 in the window that opens.
Now, run the tool and wait for the iterations to stop.
The gatool enables you to change hundreds of parameters, so you can trade speed for precision in the selected output.
The resulting vector is the list of indices that you have to keep in the original matrix so A(garesults.x,garesults.x) is the matrix with only the desired persons.
If I have understood you problem statement, you have a N x N matrix M (which happens to be a correlation matrix), and you wish to find for integer n where 2 <= n < N, a n x n matrix m which minimises the sum over all elements of m which I denote f(m)?
In Matlab it is fairly easy and fast to obtain a sub-matrix of a matrix (see for example Removing rows and columns from matrix in Matlab), and the function f is relatively inexpensive to evaluate for n = 151. So why can't you implement an algorithm that solves this backwards dynamically in a program as below where I have sketched out the pseudocode:
function reduceM(M, n){
m = M
for (ii = N to n+1) {
for (jj = 1 to ii) {
val(jj) = f(m) where mhas column and row jj removed, f(X) being summation over all elements of X
}
JJ(ii) = jj s.t. val(jj) is smallest
m = m updated by removing column and row JJ(ii)
}
}
In the end you end up with an m of dimension n which is the solution to your problem and a vector JJ which contains the indices removed at each iteration (you should easily be able to convert these back to indices applicable to the full matrix M)
There are several approaches to finding an approximate solution (eg. quadratic programming on relaxed problem or greedy search), but finding the exact solution is an NP-hard problem.
Disclaimer: I'm not an expert on binary quadratic programming, and you may want to consult the academic literature for more sophisticated algorithms.
Mathematically equivalent formulation:
Your problem is equivalent to:
For some symmetric, positive semi-definite matrix S
minimize (over vector x) x'*S*x
subject to 0 <= x(i) <= 1 for all i
sum(x)==n
x(i) is either 1 or 0 for all i
This is a quadratic programming problem where the vector x is restricted to taking only binary values. Quadratic programming where the domain is restricted to a set of discrete values is called mixed integer quadratic programming (MIQP). The binary version is sometimes called Binary Quadratic Programming (BQP). The last restriction, that x is binary, makes the problem substantially more difficult; it destroys the problem's convexity!
Quick and dirty approach to finding an approximate answer:
If you don't need a precise solution, something to play around with might be a relaxed version of the problem: drop the binary constraint. If you drop the constraint that x(i) is either 1 or 0 for all i, then the problem becomes a trivial convex optimization problem and can be solved nearly instantaneously (eg. by Matlab's quadprog). You could try removing entries that, on the relaxed problem, quadprog assigns the lowest values in the x vector, but this does not truly solve the original problem!
Note also that the relaxed problem gives you a lower bound on the optimal value of the original problem. If your discretized version of the solution to the relaxed problem leads to a value for the objective function close to the lower bound, there may be a sense in which this ad-hoc solution can't be that far off from the true solution.
To solve the relaxed problem, you might try something like:
% k is number of observations to drop
n = size(S, 1);
Aeq = ones(1,n)
beq = n-k;
[x_relax, f_relax] = quadprog(S, zeros(n, 1), [], [], Aeq, beq, zeros(n, 1), ones(n, 1));
f_relax = f_relax * 2; % Quadprog solves .5 * x' * S * x... so mult by 2
temp = sort(x_relax);
cutoff = temp(k);
x_approx = ones(n, 1);
x_approx(x_relax <= cutoff) = 0;
f_approx = x_approx' * S * x_approx;
I'm curious how good x_approx is? This doesn't solve your problem, but it might not be horrible! Note that f_relax is a lower bound on the solution to the original problem.
Software to solve your exact problem
You should check out this link and go down to the section on Mixed Integer Quadratic Programming (MIQP). It looks to me that Gurobi can solve problems of your type. Another list of solvers is here.
Working on a suggestion from Matthew Gunn and also some advice at the Gurobi forums, I came up with the following function. It seems to work pretty well.
I will award it the answer, but if someone can come up with code that works better I'll remove the tick from this answer and place it on their answer instead.
function [ values ] = the_optimal_method( CM , num_to_keep)
%the_iterative_method Takes correlation matrix CM and number to keep, returns list of people who should be kicked out
N = size(CM,1);
clear model;
names = strseq('x',[1:N]);
model.varnames = names;
model.Q = sparse(CM); % Gurobi needs a sparse matrix as input
model.A = sparse(ones(1,N));
model.obj = zeros(1,N);
model.rhs = num_to_keep;
model.sense = '=';
model.vtype = 'B';
gurobi_write(model, 'qp.mps');
results = gurobi(model);
values = results.x;
end

Add random number to cell of matrices

I have a cell, A of size 1 by 625. i.e.
A = { A1, A2, ..., A625},
where each of the 625 elements of the cell A are 3D matrices of the same size, 42 by 42 by 3.
Problem 1
Since the entries of my matrices represent concentration of red blood cells which are of very small values and I cannot simply work with randn.
For each of the matrix, I try with this command, e.g.:
A1 = A1 .*(1 + randn(42,42,3)/100)
I try with dividing by 100 to minimize the possibility of very negative number (e.g. 1.234e-6) but I cannot eliminate this possibility.
Also, is there any quick way to add different randn(42,42,3) to different 625 matrices. A + randn(42,42,3) won't work because it is adding the same set of random numbers.
Problem 2
I want to make 30 copies of the cell A by adding random numbers to each of entries of the 625 matrices. That is, I want to obtain a cell, Patients which is a cell of 1 by 30 and each of the cell element is another cell element of 625 matrices.
Patients = A % Initialization. I have 30 patients.
for i = 1 : 30;
Patients = { Patients, Patients + `method from problem 1`};
end
I have tried to make my problems clear. I appreciate so much for your help.
Problem 1:
% optional: initialize the random number generator to the current time. T
rng('shuffle');
A = cell(625);
A = cellfun( #(x) rand(42,42,3), A ,'UniformOutput', false)
note the differences between 'rand', 'randn', 'rng'
from the MATLAB documentation:
rng('shuffle') seeds the random number generator based on the current time. Thus, rand, randi, and randn produce a different sequence of numbers after each time you call rng.
rand = Uniformly distributed random numbers -> no negative values [0, 1]
randn = Normally distributed random numbers
If you want to generate numbers in a special interval (from the MATLAB documentation):
In general, you can generate N random numbers in the interval [a,b] with the formula r = a + (b-a).*rand(N,1).
Problem 2:
I highly recommend you to build a struct with your 30 cells. This will facilitate the indexing/looping later significantly! You can name the struct fields after your patients, so you will have less trouble keeping track of them.
You can also build an cell-array. This is the easiest way for indexing.
In both cases: do preallocation of the memmory! (MATLAB stores variables coherent in your memory. If the variable grows, MATLAB might have to relocate the variable...)
% for preallocation + initialization in one step:
A = cell(625);
Patients.Peter = cellfun( #(x) rand(42,42,3), A ,'UniformOutput', false);
Patients.Tom = cellfun( #(x) rand(42,42,3), A ,'UniformOutput', false);
% loop through the struct like this
FldNms = fieldnames( Patients );
for i = 0:length(Patients)
Patients.(FldNms{i}) = % do whatever you like
end
If you prefer an array:
% preallocation:
arry = cell(30);
for i = 1:length(array)
arry(i) = { cellfun( #(x) rand(42,42,3), A ,'UniformOutput', false) };
end
This is a lot of wild indexing and you will get a lot of trouble by using (),{} and [] for indexing.
Think again if you need a cell array in the first place. An array of 625 matrices might suit you as well. Data structure can significantly affect the performance, the readability and your coding time!
For your first question, what about
cellfun(#(x)(x + randn(42,42,3)/100), A, 'uni', 0)
your second problem should (A) be a second question and (B) be trivial to solve once your first is done. Just drop the Patients + and your code should work correctly.

Average part of a multidimensional array based on another array (Matlab)

B = randn(1,25,10);
Z = [1;1;1;2;2;3;4;4;4;3];
Ok, so, I want to find the locations where Z=1(or any numbers that are equal to each other), then average across each of the 25 points at these specific locations. In the example you would end with a 1*25*4 array.
Is there an easy way to do this?
I'm not the most versed in Matlab.
First things first: break down the problem.
Define the groups (i.e. the set of unique Z values)
Find elements which belong to these groups
Take the average.
Once you have done that, you can begin to see it's a pretty standard for loop and "Select columns which meet criteria".
Something along the lines of:
B = randn(1,25,10);
Z = [1;1;1;2;2;3;4;4;4;3];
groups = unique(Z); %//find the set of groups
C = nan(1,25,length(groups)); %//predefine the output space for efficiency
for gi = 1:length(groups) %//for each group
idx = Z == groups(gi); %//find it's members
C(:,:,gi) = mean(B(:,:,idx), 3); %//select and mean across the third dimension
end
If B = randn(10,25); then it's very easy because Matlab function usually works down the rows.
Using logical indexing:
ind = Z == 1;
mean(B(ind,:));
If you're dealing with multiple dimensions use permute (and reshape if you actually have 3 dimensions or more) to get yourself to a point where you're averaging down the rows as above:
B = randn(1,25,10);
BB = permute(B, [3,2,1])
continue as above

Calculating all two element sums of [a1 a2 a3 ... an] in Matlab

Context: I'm working on Project Euler Problem 23 using Matlab in order to practice my barely existing programming skills.
My Problem:
Now I have a vector with roughly 6500 numbers (ranging from 12 to 28122) as elements and want to calculate all the two element sums. That is I only need one instance of every sum, so having calculated a1 + an it's not necessary to calculate an + a1.
Edit for clarification: This includes the sums a1+a1, a2+a2,..., an+an.
The problem is that this is much too slow.
Problem specific constraints:
It's a given that sums 28123 or over aren't necessary to calculate, since those can't be used to solve the problem further.
My approach:
AbundentNumberSumsRaw=[];
for i=1:3490
AbundentNumberSumsRaw=[AbundentNumberSumRaw AbundentNumbers(i)+AbundentNumbers(i:end);
end
This works terribly :p
My Comments:
I'm pretty sure that incrementally increasing the vector AbundentNumbersRaw is bad coding, since that means memory usage will spike unnecessarily. I haven't done so, since a) I don't know what size vector to pre-allocate and b) I couldn't come up with a way to inject the sums into AbundentNumbersRaw in a orderly manner without using some ugly looking nested loops.
"for i=1:3490" is lower than the numbers of elements simply because I checked and saw that all the resulting sums for numbers whose index are above 3490 would be too large for me to use anyway.
I'm pretty sure my main issue is that the program need to do a lot of incremental increases of the vector AbundentNumbersRaw.
Any and all help and suggestions would be much appreciated :)
Cheers
Rasmus
Suppose
a = 28110*rand(6500,1)+12;
then
sums = [
a(1) + a(1:end)
a(2) + a(2:end)
...
];
is the calculation you're after.
You also state that sums whose value goes over 28123 should be discarded.
This can be generalized like so:
% Compute all 2-element sums without repetitions
C = arrayfun(#(x) a(x)+a(x:end), 1:numel(a), 'uniformoutput', false);
C = cat(1, C{:});
% discard sums exceeding threshold
C(C>28123) = [];
or using a loop
% Compute all 2-element sums without repetitions
E = cell(numel(a),1);
for ii = 1:numel(a)
E{ii} = a(ii)+a(ii:end); end
E = cat(1, E{:});
% discard sums exceeding threshold
E(E>28123) = [];
Simple testing shows that arrayfun is somewhat faster than the loop, so I'd go for the arrayfun option.
As your primary problem is to find out, which integers in a given set can be written as the sum of two integers of a different set, I'd choose a different approach:
AbundantNumbers = 1:6500; % replace with the list you generated somewhere else
maxInteger = 28122;
AbundantNumberSum(1:maxInteger) = true; % logical array
for i = 1:length(AbundantNumbers)
sumIndices = AbundantNumbers(i) + AbundantNumbers;
AbundantNumberSum(sumIndices(sumIndices <= maxInteger)) = false;
end
Unfortunantely, this is not an answer to your question but to your problem ;-) For the MatLab way to solve your original question, see the elegant answer of Rody Oldenhuis.
My approach would be the following:
v = 1:3490; % your vector here
s = length(v);
result = zeros(s); % preallocate memory
for m = 1:s
result(m,m:end) = v(m)+v(m:end);
end
You will get a matrix of 3490 x 3490 elements and more than half of them 0.