How do I use scipy.optimize to minimize a set of functions? - scipy

[EDIT: The fmin() method is a good choice for my problem. However, my problem was that one of the axes was a sum of the other axes. I wasn't recalculating the y axis after applying the multiplier. Thus, the value returned from my optimize function was always returning the same value. This gave fmin no direction so it's chosen multipliers were very close together. Once the calculations in my optimize function were corrected fmin chose a larger range.]
I have two datasets that I want to apply multipliers to to see what values could 'improve' their correlation coefficients.
For example, say data set 1 has a correlation coefficient of -.6 and data set 2 has .5.
I can apply different multipliers to each of these data sets that might improve the coefficient. I would like to find a set of multipliers to choose for these two data sets that optimizing the correlation coefficients of each set.
I have written an objective function that takes a list of multipliers, applies them to the data sets, calculates the correlation coefficient (scipy.stats.spearmanr()), and sums these coefficients. So I need to use something from scipy.optimize to pass a set of multipliers to this function and find the set that optimizes this sum.
I have tried using optimize.fmin and several others. However, I want the optimization technique to use a much larger range of multipliers. For example, my data sets might have values in the millions, but fmin will only choose multipliers around 1.0, 1.05, etc. This isn't a big enough value to modify these correlation coefficients in any meaningful way.
Here is some sample code of my objective function:
def objective_func(multipliers):
for multiplier in multipliers:
for data_set in data_sets():
x_vals = getDataSetXValues()
y_vals = getDataSetYValues()
xvals *= muliplier
coeffs.append(scipy.stats.spearmanr(x_vals, y_vals)
return -1 * sum(coeffs)
I'm using -1 because I actually want the biggest value, but fmin is for minimization.
Here is a sample of how I'm trying to use fmin:
print optimize.fmin(objective_func)
The multipliers start at 1.0 and just range between 1.05, 1.0625, etc. I can see in the actual fmin code where these values are chosen. I ultimately need another method to call to give the minimization a range of values to check for, not all so closely related.

Multiplying the x data by some factor won't really change the Spearman rank-order correlation coefficient, though.
>>> x = numpy.random.uniform(-10,10,size=(20))
>>> y = numpy.random.uniform(-10,10,size=(20))
>>> scipy.stats.spearmanr(x,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*10,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*1e6,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*1e-16,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*(-2),y)
(0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*(-2e6),y)
(0.24661654135338346, 0.29455199407204263)
(The second term in the tuple is the p value.)
You can change its sign, if you flip the signs of the terms, but the whole point of Spearman correlation is that it tells you the degree to which any monotonic relationship would capture the association. Probably that explains why fmin isn't changing the multiplier much: it's not getting any feedback on direction, because the returned value is constant.
So I don't see how what you're trying to do can work.
I'm also not sure why you've chosen the sum of all the the Spearman coefficients and the p values as what you're trying to maximize: the Spearman coefficients can be negative, so you probably want to square them, and you haven't mentioned the p values, so I'm not sure why you're throwing them in.
[It's possible I guess that we're working with different scipy versions and our spearmanr functions return different things. I've got 0.9.0.]

You probably don't want to minimize the sum of coefficients but the sum of squares. Also, if the multipliers can be chosen independently, why are you trying to optimize them all at the same time? Can you post your current code and some sample data?

Related

MATLAB - negative values go to NaN in a symmetric function

Can someone please explain why the following symmetric function cannot pass a certain limit of negative values?
D = 0.1; l = 4;
c = #(x,v) (v/D).*exp(-v*x/D)./(1-exp(-v*l/D));
v_vec = -25:0.01:25;
figure(2)
hold on
plot(v_vec,c(l,v_vec),'b')
plot(v_vec,c(0,v_vec),'r')
Notice at the figure where the blue line chops, this is where I get inf/nan values.
It seems that Matlab is trying to compute a result that is too large, outputs +inf, and then operates on that, which yields +/- inf and NaNs.
For instance, at v=-25, part of the function computes exp(-(-25)*4/0.1), which is exp(1000), and that outputs +inf. (larger than the largest representable double precision float).
You can potentially solve that problem by rewriting your function to avoid operating of such very large (or very small) numbers, say by reorganising the fraction containing exp() functions.
I did encounter the same hurdle using exp() with arguments triggering overflow. Sometimes it is difficult to trace back numeric imprecision or convergence errors. In principle the function definition using exp() only create intermediate issues as your purpose as a transition function. The intention I guess was to provide a continuous function.
My solution to this problem is to divide the argument into regions and provide in each region an approximation function. In your case zero for negative x and proportional to x for positive x. In between you can use the orginal function. Care should be taken to match the approximation at the borders of the regions and the number of continuous differentiations which is important for convergence in loops.

Mixture of 1D Gaussians fit to data in Matlab / Python

I have a discrete curve y=f(x). I know the locations and amplitudes of peaks. I want to approximate the curve by fitting a gaussian at each peak. How should I go about finding the optimized gaussian parameters ? I would like to know if there is any inbuilt function which will make my task simpler.
Edit
I have fixed mean of gaussians and tried to optimize on sigma using
lsqcurvefit() in matlab. MSE is less. However, I have an additional hard constraint that the value of approximate curve should be equal to the original function at the peaks. This constraint is not satisfied by my model. I am pasting current working code here. I would like to have a solution which obeys the hard constraint at peaks and approximately fits the curve at other points. The basic idea is that the approximate curve has fewer parameters but still closely resembles the original curve.
fun = #(x,xdata)myFun(x,xdata,pks,locs); %pks,locs are the peak locations and amplitudes already available
x0=w(1:6)*0.25; % my initial guess based on domain knowledge
[sigma resnorm] = lsqcurvefit(fun,x0,xdata,ydata); %xdata and ydata are the original curve data points
recons = myFun(sigma,xdata,pks,locs);
figure;plot(ydata,'r');hold on;plot(recons);
function f=myFun(sigma,xdata,a,c)
% a is constant , c is mean of individual gaussians
f=zeros(size(xdata));
for i = 1:6 %use 6 gaussians to approximate function
f = f + a(i) * exp(-(xdata-c(i)).^2 ./ (2*sigma(i)^2));
end
end
If you know your peak locations and amplitudes, then all you have left to do is find the width of each Gaussian. You can think of this as an optimization problem.
Say you have x and y, which are samples from the curve you want to approximate.
First, define a function g() that will construct the approximation for given values of the widths. g() takes a parameter vector sigma containing the width of each Gaussian. The locations and amplitudes of the Gaussians will be constrained to the values you already know. g() outputs the value of the sum-of-gaussians approximation at each point in x.
Now, define a loss function L(), which takes sigma as input. L(sigma) returns a scalar that measures the error--how badly the given approximation (using sigma) differs from the curve you're trying to approximate. The squared error is a common loss function for curve fitting:
L(sigma) = sum((y - g(sigma)) .^ 2)
The task now is to search over possible values of sigma, and find the choice that minimizes the error. This can be done using a variety of optimization routines.
If you have the Mathworks optimization toolbox, you can use the function lsqnonlin() (in this case you won't have to define L() yourself). The curve fitting toolbox is probably an alternative. Otherwise, you can use an open source optimization routine (check out cvxopt).
A couple things to note. You need to impose the constraint that all values in sigma are greater than zero. You can tell the optimization algorithm about this constraint. Also, you'll need to specify an initial guess for the parameters (i.e. sigma). In this case, you could probably choose something reasonable by looking at the curve in the vicinity of each peak. It may be the case (when the loss function is nonconvex) that the final solution is different, depending on the initial guess (i.e. you converge to a local minimum). There are many fancy techniques for dealing with this kind of situation, but a simple thing to do is to just try with multiple different initial guesses, and pick the best result.
Edited to add:
In python, you can use optimization routines in the scipy.optimize module, e.g. curve_fit().
Edit 2 (response to edited question):
If your Gaussians have much overlap with each other, then taking their sum may cause the height of the peaks to differ from your known values. In this case, you could take a weighted sum, and treat the weights as another parameter to optimize.
If you want the peak heights to be exactly equal to some specified values, you can enforce this constraint in the optimization problem. lsqcurvefit() won't be able to do it because it only handles bound constraints on the parameters. Take a look at fmincon().
you can use Expectation–Maximization algorithm for fitting Mixture of Gaussians on your data. it don't care about data dimension.
in documentation of MATLAB you can lookup gmdistribution.fit or fitgmdist.

Normalize in Adaboost without numerical error - Matlab

I'm implementing AdaBoost on Matlab. This algorithm requires that in every iteration the weights of each data point in the training set sum up to one.
If I simply use the following normalization v = v / sum(v) I get a vector whose 1-norm is 1 except some numerical error which later leads to the failure of the algorithm.
Is there a matlab function for normalizing a vector so that it's 1-norm is EXACTLY 1?
Assuming you want identical values to be normalised with the same factor, this is not possible. Simple counter example:
v=ones(21,1);
v = v / sum(v);
sum(v)-1
One common way to deal with it, is enforce values sum(v)>=1 or sum(v)<=1 if your algorithm can deal with a derivation to one side:
if sum(v)>1
v=v-eps(v)
end
Alternatively you can try using vpa, but this will drastically increase your computation time.

Simple binary logistic regression using MATLAB

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.

Bootstrap and asymmetric CI

I'm trying to create confidence interval for a set of data not randomly distributed and very skewed at right. Surfing, I discovered a pretty rude method that consists in using the 97.5% percentile (of my data) for the upperbound CL and 2.5% percentile for your lower CL.
Unfortunately, I need a more sophisticated way!
Then I discovered the bootstrap, precisley the MATLAB bootci function, but I'm having hard time to undestand how to used it properly.
Let's say that M is my matrix containing my data (19x100), and let's say that:
Mean = mean(M,2);
StdDev = sqrt(var(M'))';
How can I compute the asymmetrical CI for every row of the Mean vector using bootci?
Note: earlier, I was computing the CI in this very wrong way: Mean +/- 2 * StdDev, shame on me!
Let's say you have a 100x19 data set. Each column has a different distribution. We'll choose the log normal distribution, so that they skew to the right.
means = repmat(log(1:19), 100, 1);
stdevs = ones(100, 19);
X = lognrnd(means, stdevs);
Notice that each column is from the same distribution, and the rows are separate observations. Most functions in MATLAB operate on the rows by default, so it's always preferable to keep your data this way around.
You can compute bootstrap confidence intervals for the mean using the bootci function.
ci = bootci(1000, #mean, X);
This does 1000 resamplings of your data, calculates the mean for each resampling and then takes the 2.5% and 97.5% quantiles. To show that it's an asymmetric confidence interval about the mean, we can plot the mean and the confidence intervals for each column
plot(mean(X), 'r')
hold on
plot(ci')