linear combination of curves to match a single curve with integer constraints - matlab

I have a set of vectors (curves) which I would like to match to a single curve. The issue isnt only finding a linear combination of the set of curves which will most closely match the single curve (this can be done with least squares Ax = B). I need to be able to add constraints, for example limiting the number of curves used in the fitting to a particular number, or that the curves lie next to each other. These constraints would be found in mixed integer linear programming optimization.
I have started by using lsqlin which allows constraints and have been able to limit the variable to be > 0.0, but in terms of adding further constraints I am at a loss. Is there a way to add integer constraints to least squares, or alternatively is there a way to solve this with a MILP?
any help in the right direction much appreciated!
Edit: Based on the suggestion by ErwinKalvelagen I am attempting to use CPLEX and its quadtratic solvers, however until now I have not managed to get it working. I have created a minimal 'notworking' example and have uploaded the data here and code here below. The issue is that matlabs LS solver lsqlin is able to solve, however CPLEX cplexlsqnonneglin returns CPLEX Error 5002: %s is not convex for the same problem.
function [ ] = minWorkingLSexample( )
%MINWORKINGLSEXAMPLE for LS with matlab and CPLEX
%matlab is able to solve the least squares, CPLEX returns error:
% Error using cplexlsqnonneglin
% CPLEX Error 5002: %s is not convex.
%
%
% Error in Backscatter_Transform_excel2_readMut_LINPROG_CPLEX (line 203)
% cplexlsqnonneglin (C,d);
%
load('C_n_d_2.mat')
lb = zeros(size(C,2),1);
options = optimoptions('lsqlin','Algorithm','trust-region-reflective');
[fact2,resnorm,residual,exitflag,output] = ...
lsqlin(C,d,[],[],[],[],lb,[],[],options);
%% CPLEX
ctype = cellstr(repmat('C',1,size(C,2)));
options = cplexoptimset;
options.Display = 'on';
[fact3, resnorm, residual, exitflag, output] = ...
cplexlsqnonneglin (C,d);
end

I could reproduce the Cplex problem. Here is a workaround. Instead of solving the first model, use a model that is less nonlinear:
The second model solves fine with Cplex. The problem is somewhat of a tolerance/numeric issue. For the second model we have a much more well-behaved Q matrix (a diagonal). Essentially we moved some of the complexity from the objective into linear constraints.
You should now see something like:
Tried aggregator 1 time.
QP Presolve eliminated 1 rows and 1 columns.
Reduced QP has 401 rows, 443 columns, and 17201 nonzeros.
Reduced QP objective Q matrix has 401 nonzeros.
Presolve time = 0.02 sec. (1.21 ticks)
Parallel mode: using up to 8 threads for barrier.
Number of nonzeros in lower triangle of A*A' = 80200
Using Approximate Minimum Degree ordering
Total time for automatic ordering = 0.00 sec. (3.57 ticks)
Summary statistics for Cholesky factor:
Threads = 8
Rows in Factor = 401
Integer space required = 401
Total non-zeros in factor = 80601
Total FP ops to factor = 21574201
Itn Primal Obj Dual Obj Prim Inf Upper Inf Dual Inf
0 3.3391791e-01 -3.3391791e-01 9.70e+03 0.00e+00 4.20e+04
1 9.6533667e+02 -3.0509942e+03 1.21e-12 0.00e+00 1.71e-11
2 6.4361775e+01 -3.6729243e+02 3.08e-13 0.00e+00 1.71e-11
3 2.2399862e+01 -6.8231454e+01 1.14e-13 0.00e+00 3.75e-12
4 6.8012056e+00 -2.0011575e+01 2.45e-13 0.00e+00 1.04e-12
5 3.3548410e+00 -1.9547176e+00 1.18e-13 0.00e+00 3.55e-13
6 1.9866256e+00 6.0981384e-01 5.55e-13 0.00e+00 1.86e-13
7 1.4271894e+00 1.0119284e+00 2.82e-12 0.00e+00 1.15e-13
8 1.1434804e+00 1.1081026e+00 6.93e-12 0.00e+00 1.09e-13
9 1.1163905e+00 1.1149752e+00 5.89e-12 0.00e+00 1.14e-13
10 1.1153877e+00 1.1153509e+00 2.52e-11 0.00e+00 9.71e-14
11 1.1153611e+00 1.1153602e+00 2.10e-11 0.00e+00 8.69e-14
12 1.1153604e+00 1.1153604e+00 1.10e-11 0.00e+00 8.96e-14
Barrier time = 0.17 sec. (38.31 ticks)
Total time on 8 threads = 0.17 sec. (38.31 ticks)
QP status(1): optimal
Cplex Time: 0.17sec (det. 38.31 ticks)
Optimal solution found.
Objective : 1.115360
See here for some details.
Update: In Matlab this becomes:

Related

The matlab FMINCON solution terminating at upper bound

I am running an optimisation algorithm using FMINCON SQP to optimise a problem of 8 variables. Following are the parameters:
g = [g1 g2 g3 g4 g5 g6 g7 g8] %my variables
lb = [0 0 0 0 0 0 0 0];
ub = [1 1 1 1 1 1 1 1];
The minimisation of Objective function is defined as:
sse = sum((test results - calculated)).^2;
subject to linear constraints:
A = [1,1,1,1,1,1,1,1];
b = 1;
ceq = [];
The aim of the optimisation process is to fit a curve as shown in the fig.
The blue line represents test results and the magenta is the curve obtained from the optimised variables.
Problem: Although I am able to fit the curve as required, the optimised variables end up with values in the upper bound. I have implemented the following code for fmincon:
options = optimset('Display', 'iter','Algorithm','sqp', 'TolX',1e-10, 'TolFun', 1e-20,'MaxIter', 10000000000,'MaxFunEvals', 10000000000);
[y,fval,exitflag,output] = fmincon('objective', ginit, A,b,[],[],lb,ub,[], options);
Could anybody please advice me about a robust method to overcome this issue of optimisation solution terminating at the upper bound? Could you please also let me know what might be the reason behind this issue?
Beside the very accurate comment from max, are you positive that you mean this:
sse = sum((test results - calculated).^2);
Instead of this:
sse = sum((test results - calculated)).^2;
?

Truncating Poisson distribution on desired support in Matlab

I want to construct a 3-dimensional Poisson distribution in Matlab with lambda parameters [0.4, 0.2, 0.6] and I want to truncate it to have support in [0;1;2;3;4;5]. The 3 components are independent.
This is what I do
clear
n=3; %number components of the distribution
supp_marginal=0:1:5;
suppsize_marginal=size(supp_marginal,2);
supp_temp=repmat(supp_marginal.',1,n);
supp_temp_cell=num2cell(supp_temp,1);
output_temp_cell=cell(1,n);
[output_temp_cell{:}] = ndgrid(supp_temp_cell{:});
supp=zeros(suppsize_marginal^n,n);
for h=1:n
temp=output_temp_cell{h};
supp(:,h)=temp(:);
end
suppsize=size(supp,1);
lambda_1=0.4;
lambda_2=0.2;
lambda_3=0.6;
pr_mass=zeros(suppsize,1);
for j=1:suppsize
pr_mass(j)=(poisspdf(supp(j,1),lambda_1).*...
poisspdf(supp(j,2),lambda_2).*...
poisspdf(supp(j,3),lambda_3))/...
sum(poisspdf(supp(:,1),lambda_1).*...
poisspdf(supp(:,2),lambda_2).*...
poisspdf(supp(j,3),lambda_3));
end
When I compute the mean of the obtained distribution, I get lambda_1 and lambda_2 but not lambda_3.
lambda_empirical=sum(supp.*repmat(pr_mass,1,3));
Question: why I do not get lambda_3?
tl;dr: Truncation changes the distribution so different means are expected.
This is expected as truncation itself has changed the distribution and certainly adjusts the mean. You can see this from the experiment below. Notice that for your chosen parameters, this just starts to become noticable around lambda = 0.6.
Similar to the wiki page, this illustrates the difference between E[X] (expectation of X without truncation; fancy word for mean) and E[ X | LB ≤ X ≤ UB] (expectation of X given it is on interval [LB,UB]). This conditional expectation implies a different distribution than the unconditional distribution of X (~Poisson(lambda)).
% MATLAB R2018b
% Setup
LB = 0; % lowerbound
UB = 5; % upperbound
% Simple test to compare theoretical means with and without truncation
TestLam = 0.2:0.01:1.5;
Gap = zeros(size(TestLam(:)));
for jj = 1:length(TestLam)
TrueMean = mean(makedist('Poisson','Lambda',TestLam(jj)));
TruncatedMean = mean(truncate(makedist('Poisson','Lambda',TestLam(jj)),LB,UB));
Gap(jj) = TrueMean-TruncatedMean;
end
plot(TestLam,Gap)
Notice the gap with these truncation bounds and a lambda of 0.6 is still small and is negligible as lambda approaches zero.
lam = 0.6; % <---- try different values (must be greater than 0)
pd = makedist('Poisson','Lambda',lam)
pdt = truncate(pd,LB,UB)
mean(pd) % 0.6
mean(pdt) % 0.5998
Other Resources:
1. Wiki for Truncated Distributions
2. What is a Truncated Distribution
3. MATLAB documentation for truncate(), makedist()
4. MATLAB: Working with Probability Distribution (Objects)

Translating chemical equations from article, results differ (Matlab)

I've been trying to translate a set of chemical equations to MATLAB code, to be able to solve for different chemical species. I have the approximate solution (as it's from a graph) but after entering all the data and checking multiple times I still haven't been able to find what is wrong. I'm wondering what is going wrong and if anyone could please help me out. The source for the graph/equation is the article at this link: The chemistry of co-injected BOE. The graph I want to reproduce later on is figure 2 in the paper, see the image below:
Now the results I get for 10cc, 40cc and 90cc are respectively:
HF 43%, H2F2 48%, F- 3%, HF2- 6% in comparison ~28%, 63%, 2%, 7% (10cc).
HF 35%, H2F2 33%, F- 14%, HF2- 18% in comparison ~24%, 44%, 6%, 26% (40cc).
HF 21%, H2F2 12%, F- 37%, HF2- 30% in comparison ~18%, 23%, 20%, 45% (90cc).
The script is the following:
clc;
clear all;
%Units to be used
%Volume is in CC also cm^3, 1 litre is 1000 CC, 1 cc = 1 ml
%density is in g/cm^3
%weigth percentages are in fractions of 0 to 1
%Molecular weight is in g/mol
% pts=10; %number of points for linear spacing
%weight percentages of NH4OH and HF
xhf=0.49;
xnh3=0.28;
%H2O
Vh2o=1800;
dh2o=1.00; %0.997 at 25C when rounded 1
mh2o=18.02;
%HF values
Vhf=100;
dhf49=1.15;
dhf=dh2o+(dhf49-dh2o)*xhf/0.49; %# 25C
Mhf=20.01;
nhf=mols(Vhf,dhf,xhf,Mhf);
%NH4OH (NH3) values
% Vnh3=linspace(0.1*Vhf,1.9*Vhf,pts);
Vnh3=10;
dnh3=0.9; %for ~20-31% #~20-25C
Mnh3=17.03; %The wt% of NH4OH actually refers to the wt% of NH3 dissolved in H2O
nnh3=mols(Vnh3,dnh3,xnh3,Mnh3);
if max(nnh3)>=nhf
error(['There are more mols NH4OH,',num2str(max(nnh3)),', than mols HF,',num2str(nhf),'.'])
end
%% Calculations for species
Vt=(Vhf+Vh2o+Vnh3)/1000; %litre
A=nhf/Vt; %mol/l
B=nnh3/Vt; %mol/l
syms HF F H2F2 HF2 NH3 NH4 H OH
eq2= H*F/HF==6.85*10^(-4);
eq3= NH3*H/NH4==6.31*10^(-10);
eq4= H*OH==10^(-14);
eq5= HF2/(HF*F)==3.963;
eq6= H2F2/(HF^2)==2.7;
eq7= H+NH4==OH+F+HF2;
eq8= HF+F+2*H2F2+2*HF2==A;
eq9= NH3+NH4==B;
eqns=[eq2,eq5,eq6,eq8,eq4,eq3,eq9,eq7];
varias=[HF, F, H2F2, HF2, NH3, NH4, H, OH];
assume(HF> 0 & F>= 0 & H2F2>= 0 & HF2>= 0& NH3>= 0 & NH4>= 0 & H>= 0 & OH>= 0)
[HF, F, H2F2, HF2, NH3, NH4, H, OH]=vpasolve(eqns,varias);% [0 max([A,B])])
totalHF=double(HF)+double(F)+double(H2F2)+double(HF2);
HFf=double(HF)/totalHF %fraction of species for HF
H2F2f=double(H2F2)/totalHF %fraction of species for H2F2
Ff=double(F)/totalHF %fraction of species for F-
HF2f=double(HF2)/totalHF %fraction of species for HF2-
an extra function needed is called mols.m
%%%% amount of mol, Vol=volume, d=density, pwt=%weight, M=molecularweight
function mol=mols(Vol, d, pwt, M)
mol=(Vol*d*pwt)/M;
end
The equations being used from the article are in the image below:
(HF)2 is H2F2 in my script
So appears the issue wasn't so much with Matlab, had some help in that area as well.
Final solution and updated Matlab code can be found here:
https://chemistry.stackexchange.com/questions/98306/why-do-my-equilibrium-calculations-on-this-hf-nh4oh-buffer-system-not-match-thos

Getting rank deficient warning when using regress function in MATLAB

I have a dataset comprising of 30 independent variables and I tried performing linear regression in MATLAB R2010b using the regress function.
I get a warning stating that my matrix X is rank deficient to within machine precision.
Now, the coefficients I get after executing this function don't match with the experimental one.
Can anyone please suggest me how to perform the regression analysis for this equation which is comprising of 30 variables?
Going with our discussion, the reason why you are getting that warning is because you have what is known as an underdetermined system. Basically, you have a set of constraints where you have more variables that you want to solve for than the data that is available. One example of an underdetermined system is something like:
x + y + z = 1
x + y + 2z = 3
There are an infinite number of combinations of (x,y,z) that can solve the above system. For example, (x, y, z) = (1, −2, 2), (2, −3, 2), and (3, −4, 2). What rank deficient means in your case is that there is more than one set of regression coefficients that would satisfy the governing equation that would describe the relationship between your input variables and output observations. This is probably why the output of regress isn't matching up with your ground truth regression coefficients. Though it isn't the same answer, do know that the output is one possible answer. By running through regress with your data, this is what I get if I define your observation matrix to be X and your output vector to be Y:
>> format long g;
>> B = regress(Y, X);
>> B
B =
0
0
28321.7264417536
0
35241.9719076362
899.386999172398
-95491.6154990829
-2879.96318251964
-31375.7038251919
5993.52959752106
0
18312.6649115112
0
0
8031.4391233753
27923.2569044728
7716.51932560781
-13621.1638587172
36721.8387047613
80622.0849069525
-114048.707780113
-70838.6034825939
-22843.7931997405
5345.06937207617
0
106542.307940305
-14178.0346010715
-20506.8096166108
-2498.51437396558
6783.3107243113
You can see that there are seven regression coefficients that are equal to 0, which corresponds to 30 - 23 = 7. We have 30 variables and 23 constraints to work with. Be advised that this is not the only possible solution. regress essentially computes the least squared error solution such that sum of residuals of Y - X*B has the least amount of error. This essentially simplifies to:
B = X^(*)*Y
X^(*) is what is known as the pseudo-inverse of the matrix. MATLAB has this available, and it is called pinv. Therefore, if we did:
B = pinv(X)*Y
We get:
B =
44741.6923363563
32972.479220139
-31055.2846404536
-22897.9685877566
28888.7558524005
1146.70695371731
-4002.86163441217
9161.6908044046
-22704.9986509788
5526.10730457192
9161.69080479427
2607.08283489226
2591.21062004404
-31631.9969765197
-5357.85253691504
6025.47661106009
5519.89341411127
-7356.00479046122
-15411.5144034056
49827.6984426955
-26352.0537850382
-11144.2988973666
-14835.9087945295
-121.889618144655
-32355.2405829636
53712.1245333841
-1941.40823106236
-10929.3953469692
-3817.40117809984
2732.64066796307
You see that there are no zero coefficients because pinv finds the solution using the L2-norm, which promotes the "spreading" out of the errors (for a lack of a better term). You can verify that these are correct regression coefficients by doing:
>> Y2 = X*B
Y2 =
16.1491563400241
16.1264219600856
16.525331600049
17.3170318001845
16.7481541301999
17.3266932502295
16.5465094100486
16.5184456100487
16.8428701100165
17.0749421099829
16.7393450000517
17.2993993099419
17.3925811702017
17.3347117202356
17.3362798302375
17.3184486799219
17.1169638102517
17.2813552099096
16.8792925100727
17.2557945601102
17.501873690151
17.6490477001922
17.7733493802508
Similarly, if we used the regression coefficients from regress, so B = regress(Y,X); then doing Y2 = X*B, we get:
Y2 =
16.1491563399927
16.1264219599996
16.5253315999987
17.3170317999969
16.7481541299967
17.3266932499992
16.5465094099978
16.5184456099983
16.8428701099975
17.0749421099985
16.7393449999981
17.2993993099983
17.3925811699993
17.3347117199991
17.3362798299967
17.3184486799987
17.1169638100025
17.281355209999
16.8792925099983
17.2557945599979
17.5018736899983
17.6490476999977
17.7733493799981
There are some slight computational differences, which is to be expected. Similarly, we can also find the answer by using mldivide:
B = X \ Y
B =
0
0
28321.726441712
0
35241.9719075889
899.386999170666
-95491.6154989513
-2879.96318251572
-31375.7038251485
5993.52959751295
0
18312.6649114859
0
0
8031.43912336425
27923.2569044349
7716.51932559712
-13621.1638586983
36721.8387047123
80622.0849068411
-114048.707779954
-70838.6034824987
-22843.7931997086
5345.06937206919
0
106542.307940158
-14178.0346010521
-20506.8096165825
-2498.51437396236
6783.31072430201
You can see that this curiously matches up with what regress gives you. That's because \ is a more smarter operator. Depending on how your matrix is structured, it finds the solution to the system by a different method. I'd like to defer you to the post by Amro that talks about what algorithms mldivide uses when examining the properties of the input matrix being operated on:
How to implement Matlab's mldivide (a.k.a. the backslash operator "\")
What you should take away from this answer is that you can certainly go ahead and use those regression coefficients and they will more or less give you the expected output for each value of Y with each set of inputs for X. However, be warned that those coefficients are not unique. This is apparent as you said that you have ground truth coefficients that don't match up with the output of regress. It isn't matching up because it generated another answer that satisfies the constraints you have provided.
There is more than one answer that can describe that relationship if you have an underdetermined system, as you have seen by my experiments shown above.

fmincon does not match nonlinear constrains

I trying to minimize function handle with respect to vector of parameters beta0. My function uses built-in mvncdf function which uses positive definite covariance matrix. This matrix is counted from part of vector of parameters. Also there is constraint for absolute value of some parameters to be less than one.
I set constraints to fmincon in two ways: upper and lower bounds to required values and use following nonlinear constraint:
function [c,ceq] = pos_def(beta0)
rho_12 = beta0(end-2,1);
rho_13 = beta0(end-1,1);
rho_23 = beta0(end,1);
sigma111=[1 rho_12 rho_13; rho_12 1 rho_23; rho_13 rho_23 1];
sigma110=[1 rho_12 -rho_13; rho_12 1 -rho_23; -rho_13 -rho_23 1];
sigma101=[1 -rho_12 rho_13; -rho_12 1 -rho_23; rho_13 -rho_23 1];
sigma100=[1 -rho_12 -rho_13; -rho_12 1 rho_23; -rho_13 rho_23 1];
eig111 = eig(sigma111);
eig110 = eig(sigma110);
eig101 = eig(sigma101);
eig100 = eig(sigma100);
c = vertcat(-eig111,-eig110,-eig101,-eig100);
As all matrices are square and symmentric by constraction, as proxy to positive difiniteness I use signs of eigenvalues.
The optimization problem looks like:
opts = optimset ('Display','iter','TolX',1e-15,'TolFun',1e-15,...
'Algorithm','interior-point','MaxIter',100000,'MaxFunEvals',1000000);
xc3_3=fmincon(model, beta,[],[],[],[],lb,ub,#pos_def, opts)
But during estimation fmincon aborts with error
Error using mvncdf (line 193) SIGMA must be a square, symmetric, positive definite matrix.
Under debuging mode I can see that after two iterations of evaluation Matlab tries to estimate beta0 which does not sutisfy my nonlinear constraints,
beta0 =
-46.9208
33.2916
-2.1797
-46.4251
3.8337
-0.3066
6.1213
-20.9480
-1.7760
-0.1807
1.3950
4.5348
-0.9838
0.2600
-6.9887
-24.6157
-0.0112
-0.9923
-0.9284
0.7664
0.3062
And constraint c < 0 does not satisfied:
c =
0.3646
-1.2998
-2.0648
0.3646
-1.2998
-2.0648
0.3646
-1.2998
-2.0648
0.3646
-1.2998
-2.0648
I do not understand why this optimization tool trying to find solution in the prohibited area and how to avoid this problem. Or how to set constrains on positive definiteness in the linear way.
The optimizer is just evaluating points to see if they are feasible directions to move in or not. Within your model you should tell it that a particular direction is not a good one. The pseudo-code would look something like
GetEigvalues
if (positive definite) then
Do what you really want to happen
else
Return a large number
end
or alternatively
try
Do what you really want to happen
catch
Return a large number
end