Iteration of matrix-vector multiplication which stores specific index-positions - matlab

I need to solve a min distance problem, to see some of the work which has being tried take a look at:
link: click here
I have four elements: two column vectors: alpha of dim (px1) and beta of dim (qx1). In this case p = q = 50 giving two column vectors of dim (50x1) each. They are defined as follows:
alpha = alpha = 0:0.05:2;
beta = beta = 0:0.05:2;
and I have two matrices: L1 and L2.
L1 is composed of three column-vectors of dimension (kx1) each.
L2 is composed of three column-vectors of dimension (mx1) each.
In this case, they have equal size, meaning that k = m = 1000 giving: L1 and L2 of dim (1000x3) each. The values of these matrices are predefined.
They have, nevertheless, the following structure:
L1(kx3) = [t1(kx1) t2(kx1) t3(kx1)];
L2(mx3) = [t1(mx1) t2(mx1) t3(mx1)];
The min. distance problem I need to solve is given (mathematically) as follows:
d = min( (x-(alpha_p*t1_k - beta_q*t1_m)).^2 + (y-(alpha_p*t2_k - beta_q*t2_m)).^2 +
(z-(alpha_p*t3_k - beta_q*t3_m)).^2 )
the values x,y,z are three fixed constants.
My problem
I need to develop an iteration which can give me back the index positions from the combination of: alpha, beta, L1 and L2 which fulfills the min-distance problem from above.
I hope the formulation for the problem is clear, I have been very careful with the index notations. But if it is still not so clear... the step size for:
alpha is p = 1,...50
beta is q = 1,...50
for L1; t1, t2, t3 is k = 1,...,1000
for L2; t1, t2, t3 is m = 1,...,1000
And I need to find the index of p, index of q, index of k and index of m which gives me the min. distance to the point x,y,z.
Thanks in advance for your help!

I don't know your values so i wasn't able to check my code. I am using loops because it is the most obvious solution. Pretty sure that someone from the bsxfun-brigarde ( ;-D ) will find a shorter/more effective solution.
alpha = 0:0.05:2;
beta = 0:0.05:2;
L1(kx3) = [t1(kx1) t2(kx1) t3(kx1)];
L2(mx3) = [t1(mx1) t2(mx1) t3(mx1)];
idx_smallest_d =[1,1,1,1];
smallest_d = min((x-(alpha(1)*t1(1) - beta(1)*t1(1))).^2 + (y-(alpha(1)*t2(1) - beta(1)*t2(1))).^2+...
(z-(alpha(1)*t3(1) - beta(1)*t3(1))).^2);
%The min. distance problem I need to solve is given (mathematically) as follows:
for p=1:1:50
for q=1:1:50
for k=1:1:1000
for m=1:1:1000
d = min((x-(alpha(p)*t1(k) - beta(q)*t1(m))).^2 + (y-(alpha(p)*t2(k) - beta(q)*t2(m))).^2+...
(z-(alpha(p)*t3(k) - beta(q)*t3(m))).^2);
if d < smallest_d
smallest_d=d;
idx_smallest_d= [p,q,k,m];
end
end
end
end
end
What I am doing is predefining the smallest distance as the distance of the first combination and then checking for each combination rather the distance is smaller than the previous shortest distance.

Related

Minimize difference between indicator variables in Matlab

I'm new to Matlab and want to write a program that chooses the value of a parameter (P) to minimize the difference between two vectors, where each vector is a variable in a dataframe. The first vector (call it A) is a predetermined vector of 1s and 0s, and the second vector (call it B) has each of its entries determined as an indicator function that depends on the value of the parameter P and other variables in the dataframe. For instance, let C be a third variable in the dataset, so
A = [1, 0, 0, 1, 0]
B = [x, y, z, u, v]
where x = 1 if (C[1]+10)^0.5 - P > (C[1])^0.5 and otherwise x = 0, and similarly, y = 1 if (C[2]+10)^0.5 - P > (C[2])^0.5 and otherwise y = 0, and so on.
I'm not really sure where to start with the code, except that it might be useful to use the fminsearch command. Any suggestions?
Edit: I changed the above by raising to a power, which is closer to the actual example that I have. I'm also providing a complete example in response to a comment:
Let A be as above, and let C = [10, 1, 100, 1000, 1]. Then my goal with the Matlab code would be to choose a value of P to minimize the differences between the coordinates of the vectors A and B, where B[1] = 1 if (10+10)^0.5 - P > (10)^0.5 and otherwise B[1] = 0, and similarly B[2] = 1 if (1+10)^0.5 - P > (1)^0.5 and otherwise B[2] = 0, etc. So I want to choose P to maximize the likelihood that A[1] = B[1], A[2] = B[2], etc.
I have the following setup in Matlab, where ds is the name of my dataset:
ds.B = zeros(size(ds,1),1); % empty vector to fill
for i = 1:size(ds,1)
if ((ds.C(i) + 10)^(0.5) - P > (ds.C(i))^(0.5))
ds.B(i) = 1;
else
ds.B(i) = 0;
end
end
Now I want to choose the value of P to minimize the difference between A and B. How can I do this?
EDIT: I'm also wondering how to do this when the inequality is something like (C[i]+10)^0.5 - P*D[i] > (C[i])^0.5, where D is another variable in my dataset. Now P is a scalar being multiplied rather than just added. This seems more complicated since I can't solve for P exactly. How can I solve the problem in this case?
EDIT 1: It seems fminbnd() isn't optimal, likely due to the stairstep nature of the indicator function. I've updated to test the midpoints of all the regions between indicator function flips, plus endpoints.
EDIT 2: Updated to include dataset D as a coefficient of P.
If you can package your distance calculation up in a single function based on P, you can then search for its minimum.
arraySize = 1000;
ds.A = double(rand([arraySize,1]) > 0.5);
ds.C = rand(size(ds.A));
ds.D = rand(size(ds.A));
B = #(P)double((ds.C+10).^0.5 - P.*ds.D > ds.C.^0.5);
costFcn = #(P)sqrt(sum((ds.A-B(P)).^2));
% Solving the equation (C+10)^0.5 - P*D = C^0.5 for P, and sorting the results
BCrossingPoints = sort(((ds.C+10).^0.5-ds.C.^0.5)./ds.D);
% Taking the average of each crossing point with its neighbors
BMidpoints = (BCrossingPoints(1:end-1)+BCrossingPoints(2:end))/2;
% Appending endpoints onto the midpoints
PsToTest = [BCrossingPoints(1)-0.1; BMidpoints; BCrossingPoints(end)+0.1];
% Calculate the distance from A to B at each P to test
costResult = arrayfun(costFcn,PsToTest);
% Find the minimum cost
[~,lowestCostIndex] = min(costResult);
% Find the optimum P
optimumP = PsToTest(lowestCostIndex);
ds.B = B(optimumP);
semilogx(PsToTest,costResult)
xlabel('P')
ylabel('Distance from A to B')
1.- x is assumed positive real only, because with x<0 then complex values show up.
Since no comment is made in the question it seems reasonable to assume x real and x>0 only.
As requested, P 'the parameter' a scalar, P only has 2 significant states >0 or <0, let's see how is this:
2.- The following lines generate kind-of random A and C.
Then a sweep of p is carried out and distances d1 and d2 are calculated.
d1 is euclidean distance and d2 is the absolute of the difference between A and and B converting both from binary to decimal:
N=10
% A=[1 0 0 1 0]
A=randi([0 1],1,N);
% C=[10 1 1e2 1e3 1]
C=randi([0 1e3],1,N)
p=[-1e4:1:1e4]; % parameter to optimize
B=zeros(1,numel(A));
d1=zeros(1,numel(p)); % euclidean distance
d2=zeros(1,numel(p)); % difference distance
for k1=1:1:numel(p)
B=(C+10).^.5-p(k1)>C.^.5;
d1(k1)=(sum((B-A).^2))^.5;
d2(k1)=abs(sum(A.*2.^[numel(A)-1:-1:0])-sum(B.*2.^[numel(A)-1:-1:0]));
end
figure;
plot(p,d1)
grid on
xlabel('p');title('d1')
figure
plot(p,d2)
grid on
xlabel('p');title('d2')
The only degree of freedom to optimise seems to be the sign of P regardless of |P| value.
3.- f(p,x) has either no root, or just one root, depending upon p
The threshold funtion is
if f(x)>0 then B(k)==1 else B(k)==0
this is
f(p,x)=(x+10)^.5-p-x^.5
Now
(x+10).^.5-p>x.^.5 is same as (x+10).^.5-x.^.5>p
There's a range of p that keeps f(p,x)=0 without any (real) root.
For the particular case p=0 then (x+10).^.5 and x.^.5 do not intersect (until Inf reached = there's no intersection)
figure;plot(x,(x+10).^.5,x,x.^.5);grid on
[![enter image description here][3]][3]
y2=diff((x+10).^.5-x.^.5)
figure;plot(x(2:end),y2);
grid on;xlabel('x')
title('y2=diff((x+10).^.5-x.^.5)')
[![enter image description here][3]][3]
% 005
This means the condition f(x)>0 is always true holding all bits of B=1. With B=1 then d(A,B) turns into d(A,1), a constant.
However, for a certain value of p then there's one root and f(x)>0 is always false keeping all bits of B=0.
In this case d(A,B) the cost function turns into d(A,0) and this is A itself.
4.- P as a vector
The optimization gains in degrees of freedom if instead of P scalar, P is considered as vector.
For a given x there's a value of p that switches B(k) from 0 to 1.
Any value of p below such threshold keeps B(k)=0.
Equivalently, inverting f(x) :
g(p)=(10-p^2)^2/(4*p^2)>x
Values of x below this threshold bring B closer to A because for each element of B it's flipped to the element value of A.
Therefore, it's convenient to consider P as a vector, not a ascalar, and :
For all, or as many (as possible) elements of C to meet c(k)<(10-p^2)^2/(4*p^2) in order to get C=A or
minimize d(A,C)
5.- roots of f(p,x)
syms t positive
p=[-1000:.1:1000];
zp=NaN*ones(1,numel(p));
sol=zeros(1,numel(p));
for k1=1:1:numel(p)
p(k1)
eq1=(t+10)^.5-p(k1)-t^.5-p(k1)==0;
s1=solve(eq1,t);
if ~isempty(s1)
zp(k1)=s1;
end
end
nzp=~isnan(zp);
zp(nzp)
returns
=
620.0100 151.2900 64.5344 34.2225 20.2500 12.7211
8.2451 5.4056 3.5260 2.2500 1.3753 0.7803
0.3882 0.1488 0.0278

Optimization of matrix on matlab using fmincon

I have a 30x30 matrix as a base matrix (OD_b1), I also have two base vectors (bg and Ag). My aim is to optimize a matrix (X) who's dimensions are 30X30 such that:
1) the squared difference between vector (bg) and vector of sum of all the columns is minimized.
2)the squared difference between vector (Ag) and vector of sum of all rows is minimized.
3)the squared difference between the elements of matrix (X) and matrix (OD_b1) is minimized.
The mathematical form of the equation is as follows:
I have tried this:
fun=#(X)transpose(bg-sum(X,2))*(bg-sum(X,2))+ (Ag-sum(X,1))*transpose(Ag-sum(X,1))+sumsqr(X_b-X);
[val,X]=fmincon(fun,OD_b1,AA,BB,Aeq,beq,LB,UB)
I don't get errors but it seems like it's stuck.
Is it because I have too many variables or is there another reason?
Thanks in advance
This is a simple, unconstrained least squares problem and hence has a simple solution that can be expressed as the solution to a linear system.
I will show you (1) the precise and efficient way to solve this and (2) how to solve with fmincon.
The precise, efficient solution:
Problem setup
Just so we're on the same page, I initialize the variables as follows:
n = 30;
Ag = randn(n, 1); % observe the dimensions
X_b = randn(n, n);
bg = randn(n, 1);
The code:
A1 = kron(ones(1,n), eye(n));
A2 = kron(eye(n), ones(1,n));
A = (A1'*A1 + A2'*A2 + eye(n^2));
b = A1'*bg + A2'*Ag + X_b(:);
x = A \ b; % solves A*x = b
Xstar = reshape(x, n, n);
Why it works:
I first reformulated your problem so the objective is a vector x, not a matrix X. Observe that z = bg - sum(X,2) is equivalent to:
x = X(:) % vectorize X
A1 = kron(ones(1,n), eye(n)); % creates a special matrix that sums up
% stuff appropriately
z = A1*x;
Similarly, A2 is setup so that A2*x is equivalent to Ag'-sum(X,1). Your problem is then equivalent to:
minimize (over x) (bg - A1*x)'*(bg - A1*x) + (Ag - A2*x)'*(Ag - A2*x) + (y - x)'*(y-x) where y = Xb(:). That is, y is a vectorized version of Xb.
This problem is convex and the first order condition is a necessary and sufficient condition for the optimum. Take the derivative with respect to x and that equation will define your solution! Sample example math for almost equivalent (but slightly simpler problem is below):
minimize(over x) (b - A*x)'*(b - A*x) + (y - x)' * (y - x)
rewriting the objective:
b'b- b'Ax - x'A'b + x'A'Ax +y'y - 2y'x+x'x
Is equivalent to:
minimize(over x) (-2 b'A - 2y'*I) x + x' ( A'A + I) * x
the first order condition is:
(A'A+I+(A'A+I)')x -2A'b-2I'y = 0
(A'A+I) x = A'b+I'y
Your problem is essentially the same. It has the first order condition:
(A1'*A1 + A2'*A2 + I)*x = A1'*bg + A2'*Ag + y
How to solve with fmincon
You can do the following:
f = #(X) transpose(bg-sum(X,2))*(bg-sum(X,2)) + (Ag'-sum(X,1))*transpose(Ag'-sum(X,1))+sum(sum((X_b-X).^2));
o = optimoptions('fmincon');%MaxFunEvals',30000);
o.MaxFunEvals = 30000;
Xstar2 = fmincon(f,zeros(n,n),[],[],[],[],[],[],[],o);
You can then check the answers are about the same with:
normdif = norm(Xstar - Xstar2)
And you can see that gap is small, but that the linear algebra based solution is somewhat more precise:
gap = f(Xstar2) - f(Xstar)
If the fmincon approach hangs, try it with a smaller n just to gain confidence that my linear algebra based solution is more precise, way way faster etc... n = 30 is solving a 30^2 = 900 variable optimization problem: not easy. With the linear algebra approach, you can go up to n = 100 (i.e. 10000 variable problem) or even larger.
I would probably solve this as a QP using quadprog using the following reformulation (keeping the objective as simple as possible to make the problem "less nonlinear"):
min sum(i,v(i)^2)+sum(i,w(i)^2)+sum((i,j),z(i,j)^2)
v = bg - sum(c,x)
w = ag - sum(r,x)
Z = xbase-x
The QP solver is more precise (no gradients using finite differences). This approach also allows you to add additional bounds and linear equality and inequality constraints.
The other suggestion to form the first order conditions explicitly is also a good one: it also has no issue with imprecise gradients (the first order conditions are linear). I usually prefer a quadratic model because of its flexibility.

bsxfun implementation in solving a min. optimization task

I really need help with this one.
I have to matrices L1 and L2, both are (500x3) of size.
First of all, I compute the difference of every element of each column of L1 from L2 as follows:
lib1 = bsxfun(#minus, L1(:,1)',L2(:,1));
lib1=lib1(:);
lib2 = bsxfun(#minus, L1(:,2)',L2(:,2));
lib2=lib2(:);
lib3 = bsxfun(#minus, L1(:,3)',L2(:,3));
lib3=lib3(:);
LBR = [lib1 lib2 lib3];
The result is this matrix LBR. Then I have a min-problem to solve:
[d,p] = min((LBR(:,1) - var1).^2 + (LBR(:,2) - var2).^2 + (LBR(:,3) - var3).^2);
Which returns the point p where this min-problem is fulfied. Finally I can go back to my matrices L1 and L2 to find the index-positions of the values which satisfy this min-problem. I done this as follows:
[minindex_alongL2, minindex_alongL1] = ind2sub(size(L1),p);
This is OK. But what I need now is:
I have to multiply , take the tensor-product, also called Kronecker product of a vector called alpha to LBR, alpha is given as follows:
alpha = 0:0.1:2;
And, this Kronecker product I have computed as follows:
val = bsxfun(#times,LBR,permute(alpha,[3 1 2]));
LBR = reshape(permute(val,[1 3 2]),size(val,1)*size(val,3),[]);
what I need now is: I need to solve the same minproblem:
[d,p] = min((LBR(:,1) - var1).^2 + (LBR(:,2) - var2).^2 + (LBR(:,3) - var3).^2);
but, this time, in addition of finding the index-positions and values from L1 and L2 which satisfies this min-problem, I need to find the index position of the single value from the alpha vector which has been multiplied and which fulfills the min-problem. I don't have idea how can I do this so any help will be very appreciated!
Thanks in advance!
Ps: I can post the L1 and L2 matrices if needed.
I believe you need this correction in your code -
[minindex_alongL2, minindex_alongL1] = ind2sub([size(L2,1) size(L1,1)],p)
For the solution, you need to add the size of p into the index finding in the last step as the vector whose min is calculated has the "added influence" of alpha -
[minindex_alongL2, minindex_alongL1,minindex_alongalpha] = ind2sub([size(L2,1) size(L1,1) numel(alpha)],p)
minindex_alongalpha might be of your interest.

Matlab using interp1 to find the index?

I have an array of Fa which contains values I found from a function. Is there a way to use interp1 function in Matlab to find the index at which a specific value occurs? I have found tutorials for interp1 which I can find a specific value in the array using interp1 by knowing the corresponding index value.
Example from http://www.mathworks.com/help/matlab/ref/interp1.html:
Here are two vectors representing the census years from 1900 to 1990 and the corresponding United States population in millions of people.
t = 1900:10:1990;
p = [75.995 91.972 105.711 123.203 131.669...
150.697 179.323 203.212 226.505 249.633];
The expression interp1(t,p,1975) interpolates within the census data to estimate the population in 1975. The result is
ans =
214.8585
- but I want to find the t value for 214.8585.
In some sense, you want to find roots of a function -
f(x)-val
First of all, there might be several answers. Second, since the function is piecewise linear, you can check each segment by solving the relevant linear equation.
For example, suppose that you have this data:
t = 1900:10:1990;
p = [75.995 91.972 105.711 123.203 131.669...
150.697 179.323 70.212 226.505 249.633];
And you want to find the value 140
val = 140;
figure;plot(t,p);hold on;
plot( [min(t),max(t)], [val val],'r');
You should first subtract the value of val from p,
p1 = p - val;
Now you want only the segments in which p1 sign changes, either from + -> -, or vice versa.
segments = abs(diff(sign(p1)==1));
In each of these segments, you can solve the relevant linear equation a*x+b==0, and find the root. That is the index of your value.
for i=1:numel(segments)
x(1) = t(segments(i));
x(2) = t(segments(i)+1);
y(1) = p1(segments(i));
y(2) = p1(segments(i)+1);
m = (y(2)-y(1))/(x(2)-x(1));
n = y(2) - m * x(2);
index = -n/m;
scatter(index, val ,'g');
end
And here is the result:
You can search for the value in Fa directly:
idx = Fa==value_to_find;
To find the index use find function:
find(Fa==value_to_find);
Of course, this works only if the value_to_find is present in Fa. But as I understand it, this is what you want. You do not need interp for that.
If on the other hand the value might not be present in Fa, but Fa is sorted, you can search for values larger than value_to_find and take the first such index:
find(Fa>=value_to_find,1);
If your problem is more complicated than that, look at Andreys answer.
Andrey's solution works in principle, but the code presented here does not. The problem is with the definition of the segments, which yields a vector of 0's and 1's, whereafter the call to "t(segments(i))" results in an error (I tried to copy & paste the code - I hope I did not fail in that simple task).
I made a small change to the definition of the segments. It might be done more elegantly. Here it is:
t = 1900:10:1990;
p = [75.995 91.972 105.711 123.203 131.669...
150.697 179.323 70.212 226.505 249.633];
val = 140;
figure;plot(t,p,'.-');hold on;
plot( [min(t),max(t)], [val val],'r');
p1 = p - val;
tn = 1:length(t);
segments = tn([abs(diff(sign(p1)==1)) 0].*tn>0);
for i=1:numel(segments)
x(1) = t(segments(i));
x(2) = t(segments(i)+1);
y(1) = p1(segments(i));
y(2) = p1(segments(i)+1);
m = (y(2)-y(1))/(x(2)-x(1));
n = y(2) - m * x(2);
index = -n/m;
scatter(index, val ,'g');
end
interpolate the entire function to a higher precision. Then search.
t = 1900:10:1990;
p = [75.995 91.972 105.711 123.203 131.669...
150.697 179.323 203.212 226.505 249.633];
precision = 0.5;
ti = 1900:precision:1990;
pi = interp1(t,p,ti);
now pi holds all pi values for every half a year. Assuming the values always increase you could find the year by max(ti(pi < x)) where x = 214.8585. Here pi < x creates a logical vector used to filter ti to only provide the years when p is less than x. max() is then used to take the most recent year, which will also be closest to x if the assumption that p is always increasing holds.
The answer to the most general case was given above by Andrey, and I agree with it.
For the example that you stated, a simple particular solution would be:
interp1(p,t,214.8585)
In this case you are solving for the year when a given population is known.
This approach will NOT work when there is more than one solution. If you try this with Andrey's values you will only get the first solution to the problem.

How can I speed up this call to quantile in Matlab?

I have a MATLAB routine with one rather obvious bottleneck. I've profiled the function, with the result that 2/3 of the computing time is used in the function levels:
The function levels takes a matrix of floats and splits each column into nLevels buckets, returning a matrix of the same size as the input, with each entry replaced by the number of the bucket it falls into.
To do this I use the quantile function to get the bucket limits, and a loop to assign the entries to buckets. Here's my implementation:
function [Y q] = levels(X,nLevels)
% "Assign each of the elements of X to an integer-valued level"
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
if isvector(q)
q=transpose(q);
end
Y = zeros(size(X));
for i = 1:nLevels
% "The variables g and l indicate the entries that are respectively greater than
% or less than the relevant bucket limits. The line Y(g & l) = i is assigning the
% value i to any element that falls in this bucket."
if i ~= nLevels % "The default; doesnt include upper bound"
g = bsxfun(#ge,X,q(i,:));
l = bsxfun(#lt,X,q(i+1,:));
else % "For the final level we include the upper bound"
g = bsxfun(#ge,X,q(i,:));
l = bsxfun(#le,X,q(i+1,:));
end
Y(g & l) = i;
end
Is there anything I can do to speed this up? Can the code be vectorized?
If I understand correctly, you want to know how many items fell in each bucket.
Use:
n = hist(Y,nbins)
Though I am not sure that it will help in the speedup. It is just cleaner this way.
Edit : Following the comment:
You can use the second output parameter of histc
[n,bin] = histc(...) also returns an index matrix bin. If x is a vector, n(k) = >sum(bin==k). bin is zero for out of range values. If x is an M-by-N matrix, then
How About this
function [Y q] = levels(X,nLevels)
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
Y = zeros(size(X));
for i = 1:numel(q)-1
Y = Y+ X>=q(i);
end
This results in the following:
>>X = [3 1 4 6 7 2];
>>[Y, q] = levels(X,2)
Y =
1 1 2 2 2 1
q =
1 3.5 7
You could also modify the logic line to ensure values are less than the start of the next bin. However, I don't think it is necessary.
I think you shoud use histc
[~,Y] = histc(X,q)
As you can see in matlab's doc:
Description
n = histc(x,edges) counts the number of values in vector x that fall
between the elements in the edges vector (which must contain
monotonically nondecreasing values). n is a length(edges) vector
containing these counts. No elements of x can be complex.
I made a couple of refinements (including one inspired by Aero Engy in another answer) that have resulted in some improvements. To test them out, I created a random matrix of a million rows and 100 columns to run the improved functions on:
>> x = randn(1000000,100);
First, I ran my unmodified code, with the following results:
Note that of the 40 seconds, around 14 of them are spent computing the quantiles - I can't expect to improve this part of the routine (I assume that Mathworks have already optimized it, though I guess that to assume makes an...)
Next, I modified the routine to the following, which should be faster and has the advantage of being fewer lines as well!
function [Y q] = levels(X,nLevels)
p = linspace(0, 1.0, nLevels+1);
q = quantile(X,p);
if isvector(q), q = transpose(q); end
Y = ones(size(X));
for i = 2:nLevels
Y = Y + bsxfun(#ge,X,q(i,:));
end
The profiling results with this code are:
So it is 15 seconds faster, which represents a 150% speedup of the portion of code that is mine, rather than MathWorks.
Finally, following a suggestion of Andrey (again in another answer) I modified the code to use the second output of the histc function, which assigns entries to bins. It doesn't treat the columns independently, so I had to loop over the columns manually, but it seems to be performing really well. Here's the code:
function [Y q] = levels(X,nLevels)
p = linspace(0,1,nLevels+1);
q = quantile(X,p);
if isvector(q), q = transpose(q); end
q(end,:) = 2 * q(end,:);
Y = zeros(size(X));
for k = 1:size(X,2)
[junk Y(:,k)] = histc(X(:,k),q(:,k));
end
And the profiling results:
We now spend only 4.3 seconds in codes outside the quantile function, which is around a 500% speedup over what I wrote originally. I've spent a bit of time writing this answer because I think it's turned into a nice example of how you can use the MATLAB profiler and StackExchange in combination to get much better performance from your code.
I'm happy with this result, although of course I'll continue to be pleased to hear other answers. At this stage the main performance increase will come from increasing the performance of the part of the code that currently calls quantile. I can't see how to do this immediately, but maybe someone else here can. Thanks again!
You can sort the columns and divide+round the inverse indexes:
function Y = levels(X,nLevels)
% "Assign each of the elements of X to an integer-valued level"
[S,IX]=sort(X);
[grid1,grid2]=ndgrid(1:size(IX,1),1:size(IX,2));
invIX=zeros(size(X));
invIX(sub2ind(size(X),IX(:),grid2(:)))=grid1;
Y=ceil(invIX/size(X,1)*nLevels);
Or you can use tiedrank:
function Y = levels(X,nLevels)
% "Assign each of the elements of X to an integer-valued level"
R=tiedrank(X);
Y=ceil(R/size(X,1)*nLevels);
Surprisingly, both these solutions are slightly slower than the quantile+histc solution.