Ridge regression in Matlab produces different results than direct calculation

Ridge regression in Matlab produces different results than direct calculation - matlab

I am trying to run Ridge regression in Matlab (ridge): L is some matrix, x is some random vector and y=Lx+αn is another vector. I expect ridge(y,L,α) to return the same result as:
(LL'+α^2I)^(−1)L'y, but the latter is significantly better. I can't understand the problem as I think this is exactly what ridge() does. I even tried with α=1.
For example:
n = randn(N^2,1);
n2 = randn(N^2,1);
L = (some N^2*N^2 matrix);
y = L*n + n2;
x_ridge = ridge(y,L,1);
x_ls = (L*L' + eye(N^2))^-1*L'*y;
and x_ls and x_ridge are significantly different.
Appreciate any help!

Related

Something's wrong with my Logistic Regression?

I'm trying to verify if my implementation of Logistic Regression in Matlab is good. I'm doing so by comparing the results I get via my implementation with the results given by the built-in function mnrfit.
The dataset D,Y that I have is such that each row of D is an observation in R^2 and the labels in Y are either 0 or 1. Thus, D is a matrix of size (n,2), and Y is a vector of size (n,1)
Here's how I do my implementation:
I first normalize my data and augment it to include the offset :
d = 2; %dimension of data
M = mean(D) ;
centered = D-repmat(M,n,1) ;
devs = sqrt(sum(centered.^2)) ;
normalized = centered./repmat(devs,n,1) ;
X = [normalized,ones(n,1)];
I will be doing my calculations on X.
Second, I define the gradient and hessian of the likelihood of Y|X:
function grad = gradient(w)
grad = zeros(1,d+1) ;
for i=1:n
grad = grad + (Y(i)-sigma(w'*X(i,:)'))*X(i,:) ;
end
end
function hess = hessian(w)
hess = zeros(d+1,d+1) ;
for i=1:n
hess = hess - sigma(w'*X(i,:)')*sigma(-w'*X(i,:)')*X(i,:)'*X(i,:) ;
end
end
with sigma being a Matlab function encoding the sigmoid function z-->1/(1+exp(-z)).
Third, I run the Newton algorithm on gradient to find the roots of the gradient of the likelihood. I implemented it myself. It behaves as expected as the norm of the difference between the iterates goes to 0. I wrote it based on this script.
I verified that the gradient at the wOPT returned by my Newton implementation is null:
gradient(wOP)
ans =
1.0e-15 *
0.0139 -0.0021 0.2290
and that the hessian has strictly negative eigenvalues
eig(hessian(wOPT))
ans =
-7.5459
-0.0027
-0.0194
Here's the wOPT I get with my implementation:
wOPT =
-110.8873
28.9114
1.3706
the offset being the last element. In order to plot the decision line, I should convert the slope wOPT(1:2) using M and devs. So I set :
my_offset = wOPT(end);
my_slope = wOPT(1:d)'.*devs + M ;
and I get:
my_slope =
1.0e+03 *
-7.2109 0.8166
my_offset =
1.3706
Now, when I run B=mnrfit(D,Y+1), I get
B =
-1.3496
1.7052
-1.0238
The offset is stored in B(1).
I get very different values. I would like to know what I am doing wrong. I have some doubt about the normalization and 'un-normalization' process. But I'm not sure, may be I'm doing something else wrong.
Additional Info
When I tape :
B=mnrfit(normalized,Y+1)
I get
-1.3706
110.8873
-28.9114
which is a rearranged version of the opposite of my wOPT. It contains exactly the same elements.
It seems likely that my scaling back of the learnt parameters is wrong. Otherwise, it would have given the same as B=mnrfit(D,Y+1)

Optimization of matrix on matlab using fmincon

I have a 30x30 matrix as a base matrix (OD_b1), I also have two base vectors (bg and Ag). My aim is to optimize a matrix (X) who's dimensions are 30X30 such that:
1) the squared difference between vector (bg) and vector of sum of all the columns is minimized.
2)the squared difference between vector (Ag) and vector of sum of all rows is minimized.
3)the squared difference between the elements of matrix (X) and matrix (OD_b1) is minimized.
The mathematical form of the equation is as follows:
I have tried this:
fun=#(X)transpose(bg-sum(X,2))*(bg-sum(X,2))+ (Ag-sum(X,1))*transpose(Ag-sum(X,1))+sumsqr(X_b-X);
[val,X]=fmincon(fun,OD_b1,AA,BB,Aeq,beq,LB,UB)
I don't get errors but it seems like it's stuck.
Is it because I have too many variables or is there another reason?
Thanks in advance

This is a simple, unconstrained least squares problem and hence has a simple solution that can be expressed as the solution to a linear system.
I will show you (1) the precise and efficient way to solve this and (2) how to solve with fmincon.
The precise, efficient solution:
Problem setup
Just so we're on the same page, I initialize the variables as follows:
n = 30;
Ag = randn(n, 1); % observe the dimensions
X_b = randn(n, n);
bg = randn(n, 1);
The code:
A1 = kron(ones(1,n), eye(n));
A2 = kron(eye(n), ones(1,n));
A = (A1'*A1 + A2'*A2 + eye(n^2));
b = A1'*bg + A2'*Ag + X_b(:);
x = A \ b; % solves A*x = b
Xstar = reshape(x, n, n);
Why it works:
I first reformulated your problem so the objective is a vector x, not a matrix X. Observe that z = bg - sum(X,2) is equivalent to:
x = X(:) % vectorize X
A1 = kron(ones(1,n), eye(n)); % creates a special matrix that sums up
% stuff appropriately
z = A1*x;
Similarly, A2 is setup so that A2*x is equivalent to Ag'-sum(X,1). Your problem is then equivalent to:
minimize (over x) (bg - A1*x)'*(bg - A1*x) + (Ag - A2*x)'*(Ag - A2*x) + (y - x)'*(y-x) where y = Xb(:). That is, y is a vectorized version of Xb.
This problem is convex and the first order condition is a necessary and sufficient condition for the optimum. Take the derivative with respect to x and that equation will define your solution! Sample example math for almost equivalent (but slightly simpler problem is below):
minimize(over x) (b - A*x)'*(b - A*x) + (y - x)' * (y - x)
rewriting the objective:
b'b- b'Ax - x'A'b + x'A'Ax +y'y - 2y'x+x'x
Is equivalent to:
minimize(over x) (-2 b'A - 2y'*I) x + x' ( A'A + I) * x
the first order condition is:
(A'A+I+(A'A+I)')x -2A'b-2I'y = 0
(A'A+I) x = A'b+I'y
Your problem is essentially the same. It has the first order condition:
(A1'*A1 + A2'*A2 + I)*x = A1'*bg + A2'*Ag + y
How to solve with fmincon
You can do the following:
f = #(X) transpose(bg-sum(X,2))*(bg-sum(X,2)) + (Ag'-sum(X,1))*transpose(Ag'-sum(X,1))+sum(sum((X_b-X).^2));
o = optimoptions('fmincon');%MaxFunEvals',30000);
o.MaxFunEvals = 30000;
Xstar2 = fmincon(f,zeros(n,n),[],[],[],[],[],[],[],o);
You can then check the answers are about the same with:
normdif = norm(Xstar - Xstar2)
And you can see that gap is small, but that the linear algebra based solution is somewhat more precise:
gap = f(Xstar2) - f(Xstar)
If the fmincon approach hangs, try it with a smaller n just to gain confidence that my linear algebra based solution is more precise, way way faster etc... n = 30 is solving a 30^2 = 900 variable optimization problem: not easy. With the linear algebra approach, you can go up to n = 100 (i.e. 10000 variable problem) or even larger.

I would probably solve this as a QP using quadprog using the following reformulation (keeping the objective as simple as possible to make the problem "less nonlinear"):
min sum(i,v(i)^2)+sum(i,w(i)^2)+sum((i,j),z(i,j)^2)
v = bg - sum(c,x)
w = ag - sum(r,x)
Z = xbase-x
The QP solver is more precise (no gradients using finite differences). This approach also allows you to add additional bounds and linear equality and inequality constraints.
The other suggestion to form the first order conditions explicitly is also a good one: it also has no issue with imprecise gradients (the first order conditions are linear). I usually prefer a quadratic model because of its flexibility.

My example shows SVD is less numerically stable than QR decomposition

I asked this question in Math Stackexchange, but it seems it didn't get enough attention there so I am asking it here. https://math.stackexchange.com/questions/1729946/why-do-we-say-svd-can-handle-singular-matrx-when-doing-least-square-comparison?noredirect=1#comment3530971_1729946
I learned from some tutorials that SVD should be more stable than QR decomposition when solving Least Square problem, and it is able to handle singular matrix. But the following example I wrote in matlab seems to support the opposite conclusion. I don't have a deep understanding of SVD, so if you could look at my questions in the old post in Math StackExchange and explain it to me, I would appreciate a lot.
I use a matrix that have a large condition number(e+13). The result shows SVD get a much larger error(0.8) than QR(e-27)
% we do a linear regression between Y and X
data= [
47.667483331 -122.1070832;
47.667483331001 -122.1070832
];
X = data(:,1);
Y = data(:,2);
X_1 = [ones(length(X),1),X];
%%
%SVD method
[U,D,V] = svd(X_1,'econ');
beta_svd = V*diag(1./diag(D))*U'*Y;
%% QR method(here one can also use "\" operator, which will get the same result as I tested. I just wrote down backward substitution to educate myself)
[Q,R] = qr(X_1)
%now do backward substitution
[nr nc] = size(R)
beta_qr=[]
Y_1 = Q'*Y
for i = nc:-1:1
s = Y_1(i)
for j = m:-1:i+1
s = s - R(i,j)*beta_qr(j)
end
beta_qr(i) = s/R(i,i)
end
svd_error = 0;
qr_error = 0;
for i=1:length(X)
svd_error = svd_error + (Y(i) - beta_svd(1) - beta_svd(2) * X(i))^2;
qr_error = qr_error + (Y(i) - beta_qr(1) - beta_qr(2) * X(i))^2;
end

You SVD-based approach is basically the same as the pinv function in MATLAB (see Pseudo-inverse and SVD). What you are missing though (for numerical reasons) is using a tolerance value such that any singular values less than this tolerance are treated as zero.
If you refer to edit pinv.m, you can see something like the following (I won't post the exact code here because the file is copyrighted to MathWorks):
[U,S,V] = svd(A,'econ');
s = diag(S);
tol = max(size(A)) * eps(norm(s,inf));
% .. use above tolerance to truncate singular values
invS = diag(1./s);
out = V*invS*U';
In fact pinv has a second syntax where you can explicitly specify the tolerance value pinv(A,tol) if the default one is not suitable...
So when solving a least-squares problem of the form minimize norm(A*x-b), you should understand that the pinv and mldivide solutions have different properties:
x = pinv(A)*b is characterized by the fact that norm(x) is smaller than the norm of any other solution.
x = A\b has the fewest possible nonzero components (i.e sparse).
Using your example (note that rcond(A) is very small near machine epsilon):
data = [
47.667483331 -122.1070832;
47.667483331001 -122.1070832
];
A = [ones(size(data,1),1), data(:,1)];
b = data(:,2);
Let's compare the two solutions:
x1 = A\b;
x2 = pinv(A)*b;
First you can see how mldivide returns a solution x1 with one zero component (this is obviously a valid solution because you can solve both equations by multiplying by zero as in b + a*0 = b):
>> sol = [x1 x2]
sol =
-122.1071 -0.0537
0 -2.5605
Next you see how pinv returns a solution x2 with a smaller norm:
>> nrm = [norm(x1) norm(x2)]
nrm =
122.1071 2.5611
Here is the error of both solutions which is acceptably very small:
>> err = [norm(A*x1-b) norm(A*x2-b)]
err =
1.0e-11 *
0 0.1819
Note that use mldivide, linsolve, or qr will give pretty much same results:
>> x3 = linsolve(A,b)
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND = 2.159326e-16.
x3 =
-122.1071
0
>> [Q,R] = qr(A); x4 = R\(Q'*b)
x4 =
-122.1071
0

SVD can handle rank-deficiency. The diagonal matrix D has a near-zero element in your code and you need use pseudoinverse for SVD, i.e. set the 2nd element of 1./diag(D) to 0 other than the huge value (10^14). You should find SVD and QR have equally good accuracy in your example. For more information, see this document http://www.cs.princeton.edu/courses/archive/fall11/cos323/notes/cos323_f11_lecture09_svd.pdf

Try this SVD version called block SVD - you just set the iterations equal to the accuracy you want - usually 1 is enough. If you want all the factors (this has a default # selected for factor reduction) then edit the line k= to the size(matrix) if I recall my MATLAB correctly
A= randn(100,5000);
A=corr(A);
% A is your correlation matrix
tic
k = 1000; % number of factors to extract
bsize = k +50;
block = randn(size(A,2),bsize);
iter = 2; % could set via tolerance
[block,R] = qr(A*block,0);
for i=1:iter
[block,R] = qr(A*(A'*block),0);
end
M = block'*A;
% Economy size dense SVD.
[U,S] = svd(M,0);
U = block*U(:,1:k);
S = S(1:k,1:k);
% Note SVD of a symmetric matrix is:
% A = U*S*U' since V=U in this case, S=eigenvalues, U=eigenvectors
V=real(U*sqrt(S)); %scaling matrix for simulation
toc
% reduced randomized matrix for simulation
sims = 2000;
randnums = randn(k,sims);
corrrandnums = V*randnums;
est_corr_matrix = corr(corrrandnums');
total_corrmatrix_difference =sum(sum(est_corr_matrix-A))

Vectorizing the solution of a linear equation system in MATLAB

Summary: This question deals with the improvement of an algorithm for the computation of linear regression.
I have a 3D (dlMAT) array representing monochrome photographs of the same scene taken at different exposure times (the vector IT) . Mathematically, every vector along the 3rd dimension of dlMAT represents a separate linear regression problem that needs to be solved. The equation whose coefficients need to be estimated is of the form:
DL = R*IT^P, where DL and IT are obtained experimentally and R and P must be estimated.
The above equation can be transformed into a simple linear model after applying a logarithm:
log(DL) = log(R) + P*log(IT) => y = a + b*x
Presented below is the most "naive" way to solve this system of equations, which essentially involves iterating over all "3rd dimension vectors" and fitting a polynomial of order 1 to (IT,DL(ind1,ind2,:):
%// Define some nominal values:
R = 0.3;
IT = 600:600:3000;
P = 0.97;
%// Impose some believable spatial variations:
pMAT = 0.01*randn(3)+P;
rMAT = 0.1*randn(3)+R;
%// Generate "fake" observation data:
dlMAT = bsxfun(#times,rMAT,bsxfun(#power,permute(IT,[3,1,2]),pMAT));
%// Regression:
sol = cell(size(rMAT)); %// preallocation
for ind1 = 1:size(dlMAT,1)
for ind2 = 1:size(dlMAT,2)
sol{ind1,ind2} = polyfit(log(IT(:)),log(squeeze(dlMAT(ind1,ind2,:))),1);
end
end
fittedP = cellfun(#(x)x(1),sol); %// Estimate of pMAT
fittedR = cellfun(#(x)exp(x(2)),sol); %// Estimate of rMAT
The above approach seems like a good candidate for vectorization, since it does not utilize MATLAB's main strength that is MATrix operations. For this reason, it does not scale very well and takes much longer to execute than I think it should.
There exist alternative ways to perform this computation based on matrix division, as demonstrated here and here, which involve something like this:
sol = [ones(size(x)),log(x)]\log(y);
That is, appending a vector of 1s to the observations, followed by mldivide to solve the equation system.
The main challenge I'm facing is how to adapt my data to the algorithm (or vice versa).
Question #1: How can the matrix-division-based solution be extended to solve the problem presented above (and potentially replace the loops I am using)?
Question #2 (bonus): What is the principle behind this matrix-division-based solution?

The secret ingredient behind the solution that includes matrix division is the Vandermonde matrix. The question discusses a linear problem (linear regression), and those can always be formulated as a matrix problem, which \ (mldivide) can solve in a mean-square error sense‡. Such an algorithm, solving a similar problem, is demonstrated and explained in this answer.
Below is benchmarking code that compares the original solution with two alternatives suggested in chat1, 2 :
function regressionBenchmark(numEl)
clc
if nargin<1, numEl=10; end
%// Define some nominal values:
R = 5;
IT = 600:600:3000;
P = 0.97;
%// Impose some believable spatial variations:
pMAT = 0.01*randn(numEl)+P;
rMAT = 0.1*randn(numEl)+R;
%// Generate "fake" measurement data using the relation "DL = R*IT.^P"
dlMAT = bsxfun(#times,rMAT,bsxfun(#power,permute(IT,[3,1,2]),pMAT));
%% // Method1: loops + polyval
disp('-------------------------------Method 1: loops + polyval')
tic; [fR,fP] = method1(IT,dlMAT); toc;
fprintf(1,'Regression performance:\nR: %d\nP: %d\n',norm(fR-rMAT,1),norm(fP-pMAT,1));
%% // Method2: loops + Vandermonde
disp('-------------------------------Method 2: loops + Vandermonde')
tic; [fR,fP] = method2(IT,dlMAT); toc;
fprintf(1,'Regression performance:\nR: %d\nP: %d\n',norm(fR-rMAT,1),norm(fP-pMAT,1));
%% // Method3: vectorized Vandermonde
disp('-------------------------------Method 3: vectorized Vandermonde')
tic; [fR,fP] = method3(IT,dlMAT); toc;
fprintf(1,'Regression performance:\nR: %d\nP: %d\n',norm(fR-rMAT,1),norm(fP-pMAT,1));
function [fittedR,fittedP] = method1(IT,dlMAT)
sol = cell(size(dlMAT,1),size(dlMAT,2));
for ind1 = 1:size(dlMAT,1)
for ind2 = 1:size(dlMAT,2)
sol{ind1,ind2} = polyfit(log(IT(:)),log(squeeze(dlMAT(ind1,ind2,:))),1);
end
end
fittedR = cellfun(#(x)exp(x(2)),sol);
fittedP = cellfun(#(x)x(1),sol);
function [fittedR,fittedP] = method2(IT,dlMAT)
sol = cell(size(dlMAT,1),size(dlMAT,2));
for ind1 = 1:size(dlMAT,1)
for ind2 = 1:size(dlMAT,2)
sol{ind1,ind2} = flipud([ones(numel(IT),1) log(IT(:))]\log(squeeze(dlMAT(ind1,ind2,:)))).'; %'
end
end
fittedR = cellfun(#(x)exp(x(2)),sol);
fittedP = cellfun(#(x)x(1),sol);
function [fittedR,fittedP] = method3(IT,dlMAT)
N = 1; %// Degree of polynomial
VM = bsxfun(#power, log(IT(:)), 0:N); %// Vandermonde matrix
result = fliplr((VM\log(reshape(dlMAT,[],size(dlMAT,3)).')).');
%// Compressed version:
%// result = fliplr(([ones(numel(IT),1) log(IT(:))]\log(reshape(dlMAT,[],size(dlMAT,3)).')).');
fittedR = exp(real(reshape(result(:,2),size(dlMAT,1),size(dlMAT,2))));
fittedP = real(reshape(result(:,1),size(dlMAT,1),size(dlMAT,2)));
The reason why method 2 can be vectorized into method 3 is essentially that matrix multiplication can be separated by the columns of the second matrix. If A*B produces matrix X, then by definition A*B(:,n) gives X(:,n) for any n. Moving A to the right-hand side with mldivide, this means that the divisions A\X(:,n) can be done in one go for all n with A\X. The same holds for an overdetermined system (linear regression problem), in which there is no exact solution in general, and mldivide finds the matrix that minimizes the mean-square error. In this case too, the operations A\X(:,n) (method 2) can be done in one go for all n with A\X (method 3).
The implications of improving the algorithm when increasing the size of dlMAT can be seen below:
For the case of 500*500 (or 2.5E5) elements, the speedup from Method 1 to Method 3 is about x3500!
It is also interesting to observe the output of profile (here, for the case of 500*500):
Method 1
Method 2
Method 3
From the above it is seen that rearranging the elements via squeeze and flipud takes up about half (!) of the runtime of Method 2. It is also seen that some time is lost on the conversion of the solution from cells to matrices.
Since the 3rd solution avoids all of these pitfalls, as well as the loops altogether (which mostly means re-evaluation of the script on every iteration) - it unsurprisingly results in a considerable speedup.
Notes:
There was very little difference between the "compressed" and the "explicit" versions of Method 3 in favor of the "explicit" version. For this reason it was not included in the comparison.
A solution was attempted where the inputs to Method 3 were gpuArray-ed. This did not provide improved performance (and even somewhat degradaed them), possibly due to wrong implementation, or the overhead associated with copying matrices back and forth between RAM and VRAM.

Histogram matching of 3D datasets using L2 norm minimisation (Matlab)

I need to perform some elementary histogram matching on 2 sets of 3D data. This is part of a larger algorithm.
My goal is to perform this by minimising the following cost function:
|| cumpdf(f(A)) - cumpdf(B) || .^2
where:
cumpdf is the cumulative histogram
f() is linear transformation a*A + b where a/b are affine coefficients to be
determined
A is the image to be transformed and B is the image to be matched
I am using lsqcurvefit however I have run into some trouble and therefore really need some help.
A(maskA==0)=0;
B(maskB==0)=0;
[na,~] = hist(A(maskA~=0),500);
na = na ./ numel(A(maskA~=0));
x_data = cumsum(na);
[nb,~] = hist(B(maskB~=0),500);
nb = nb ./ numel(B(maskB~=0));
y_data = cumsum(nb);
xo = [1.5 -200];
[coeff,~] = lsqcurvefit(#cost,xo,x_data,y_data);
function F = cost(x,xc)
F = x(1).*A + x(2);
[nc,~] = hist(C(maskA~=0),500);
nc = nc / numel(C(maskA~=0));
xc = cumsum(nc);
Amask and Bmask just represent some indexing I need to do.
My question is: I know that the above is wrong. However, I think it represents best what I want to do, regarding the cost function and the goal. Some help would me much appreciated!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Ridge regression in Matlab produces different results than direct calculation - matlab

Related

Something's wrong with my Logistic Regression?

Optimization of matrix on matlab using fmincon

My example shows SVD is less numerically stable than QR decomposition

Vectorizing the solution of a linear equation system in MATLAB

Histogram matching of 3D datasets using L2 norm minimisation (Matlab)

Categories

Resources