Normalize in Adaboost without numerical error - Matlab - matlab

I'm implementing AdaBoost on Matlab. This algorithm requires that in every iteration the weights of each data point in the training set sum up to one.
If I simply use the following normalization v = v / sum(v) I get a vector whose 1-norm is 1 except some numerical error which later leads to the failure of the algorithm.
Is there a matlab function for normalizing a vector so that it's 1-norm is EXACTLY 1?

Assuming you want identical values to be normalised with the same factor, this is not possible. Simple counter example:
v=ones(21,1);
v = v / sum(v);
sum(v)-1
One common way to deal with it, is enforce values sum(v)>=1 or sum(v)<=1 if your algorithm can deal with a derivation to one side:
if sum(v)>1
v=v-eps(v)
end
Alternatively you can try using vpa, but this will drastically increase your computation time.

Related

Optimizing huge amounts of calls of fsolve in Matlab

I'm solving a pair of non-linear equations for each voxel in a dataset of a ~billion voxels using fsolve() in MATLAB 2016b.
I have done all the 'easy' optimizations that I'm aware of. Memory localization is OK, I'm using parfor, the equations are in fairly numerically simple form. All discontinuities of the integral are fed to integral(). I'm using the Levenberg-Marquardt algorithm with good starting values and a suitable starting damping constant, it converges on average with 6 iterations.
I'm now at ~6ms per voxel, which is good, but not good enough. I'd need a order of magnitude reduction to make the technique viable. There's only a few things that I can think of improving before starting to hammer away at accuracy:
The splines in the equation are for quick sampling of complex equations. There are two for each equation, one is inside the 'complicated nonlinear equation'. They represent two equations, one which is has a large amount of terms but is smooth and has no discontinuities and one which approximates a histogram drawn from a spectrum. I'm using griddedInterpolant() as the editor suggested.
Is there a faster way to sample points from pre-calculated distributions?
parfor i=1:numel(I1)
sols = fsolve(#(x) equationPair(x, input1, input2, ...
6 static inputs, fsolve options)
output1(i) = sols(1); output2(i) = sols(2)
end
When calling fsolve, I'm using the 'parametrization' suggested by Mathworks to input the variables. I have a nagging feeling that defining a anonymous function for each voxel is taking a large slice of the time at this point. Is this true, is there a relatively large overhead for defining the anonymous function again and again? Do I have any way to vectorize the call to fsolve?
There are two input variables which keep changing, all of the other input variables stay static. I need to solve one equation pair for each input pair so I can't make it a huge system and solve it at once. Do I have any other options than fsolve for solving pairs of nonlinear equations?
If not, some of the static inputs are the fairly large. Is there a way to keep the inputs as persistent variables using MATLAB's persistent, would that improve performance? I only saw examples of how to load persistent variables, how could I make it so that they would be input only once and future function calls would be spared from the assumedly largish overhead of the large inputs?
EDIT:
The original equations in full form look like:
Where:
and:
Everything else is known, solving for x_1 and x_2. f_KN was approximated by a spline. S_low (E) and S_high(E) are splines, the histograms they are from look like:
So, there's a few things I thought of:
Lookup table
Because the integrals in your function do not depend on any of the parameters other than x, you could make a simple 2D-lookup table from them:
% assuming simple (square) range here, adjust as needed
[x1,x2] = meshgrid( linspace(0, xmax, N) );
LUT_high = zeros(size(x1));
LUT_low = zeros(size(x1));
for ii = 1:N
LUT_high(:,ii) = integral(#(E) Fhi(E, x1(1,ii), x2(:,ii)), ...
0, E_high, ...
'ArrayValued', true);
LUT_low(:,ii) = integral(#(E) Flo(E, x1(1,ii), x2(:,ii)), ...
0, E_low, ...
'ArrayValued', true);
end
where Fhi and Flo are helper functions to compute those integrals, vectorized with scalar x1 and vector x2 in this example. Set N as high as memory will allow.
Those lookup tables you then pass as parameters to equationPair() (which allows parfor to distribute the data). Then just use interp2 in equationPair():
F(1) = I_high - interp2(x1,x2,LUT_high, x(1), x(2));
F(2) = I_low - interp2(x1,x2,LUT_low , x(1), x(2));
So, instead of recomputing the whole integral every time, you evaluate it once for the expected range of x, and reuse the outcomes.
You can specify the interpolation method used, which is linear by default. Specify cubic if you're really concerned about accuracy.
Coarse/Fine
Should the lookup table method not be possible for some reason (memory limitations, in case the possible range of x is too big), here's another thing you could do: split up the whole procedure in 2 parts, which I'll call coarse and fine.
The intent of the coarse method is to improve your initial estimates really quickly, but perhaps not so accurately. The quickest way to approximate that integral by far is via the rectangle method:
do not approximate S with a spline, just use the original tabulated data (so S_high/low = [S_high/low#E0, S_high/low#E1, ..., S_high/low#E_high/low]
At the same values for E as used by the S data (E0, E1, ...), evaluate the exponential at x:
Elo = linspace(0, E_low, numel(S_low)).';
integrand_exp_low = exp(x(1)./Elo.^3 + x(2)*fKN(Elo));
Ehi = linspace(0, E_high, numel(S_high)).';
integrand_exp_high = exp(x(1)./Ehi.^3 + x(2)*fKN(Ehi));
then use the rectangle method:
F(1) = I_low - (S_low * Elo) * (Elo(2) - Elo(1));
F(2) = I_high - (S_high * Ehi) * (Ehi(2) - Ehi(1));
Running fsolve like this for all I_low and I_high will then have improved your initial estimates x0 probably to a point close to "actual" convergence.
Alternatively, instead of the rectangle method, you use trapz (trapezoidal method). A tad slower, but possibly a bit more accurate.
Note that if (Elo(2) - Elo(1)) == (Ehi(2) - Ehi(1)) (step sizes are equal), you can further reduce the number of computations. In that case, the first N_low elements of the two integrands are identical, so the values of the exponentials will only differ in the N_low + 1 : N_high elements. So then just compute integrand_exp_high, and set integrand_exp_low equal to the first N_low elements of integrand_exp_high.
The fine method then uses your original implementation (with the actual integral()s), but then starting at the updated initial estimates from the coarse step.
The whole objective here is to try and bring the total number of iterations needed down from about 6 to less than 2. Perhaps you'll even find that the trapz method already provides enough accuracy, rendering the whole fine step unnecessary.
Vectorization
The rectangle method in the coarse step outlined above is easy to vectorize:
% (uses R2016b implicit expansion rules)
Elo = linspace(0, E_low, numel(S_low));
integrand_exp_low = exp(x(:,1)./Elo.^3 + x(:,2).*fKN(Elo));
Ehi = linspace(0, E_high, numel(S_high));
integrand_exp_high = exp(x(:,1)./Ehi.^3 + x(:,2).*fKN(Ehi));
F = [I_high_vector - (S_high * integrand_exp_high) * (Ehi(2) - Ehi(1))
I_low_vector - (S_low * integrand_exp_low ) * (Elo(2) - Elo(1))];
trapz also works on matrices; it will integrate over each column in the matrix.
You'd call equationPair() then using x0 = [x01; x02; ...; x0N], and fsolve will then converge to [x1; x2; ...; xN], where N is the number of voxels, and each x0 is 1×2 ([x(1) x(2)]), so x0 is N×2.
parfor should be able to slice all of this fairly easily over all the workers in your pool.
Similarly, vectorization of the fine method should also be possible; just use the 'ArrayValued' option to integral() as shown above:
F = [I_high_vector - integral(#(E) S_high(E) .* exp(x(:,1)./E.^3 + x(:,2).*fKN(E)),...
0, E_high,...
'ArrayValued', true);
I_low_vector - integral(#(E) S_low(E) .* exp(x(:,1)./E.^3 + x(:,2).*fKN(E)),...
0, E_low,...
'ArrayValued', true);
];
Jacobian
Taking derivatives of your function is quite easy. Here is the derivative w.r.t. x_1, and here w.r.t. x_2. Your Jacobian will then have to be a 2×2 matrix
J = [dF(1)/dx(1) dF(1)/dx(2)
dF(2)/dx(1) dF(2)/dx(2)];
Don't forget the leading minus sign (F = I_hi/lo - g(x) → dF/dx = -dg/dx)
Using one or both of the methods outlined above, you can implement a function to compute the Jacobian matrix and pass this on to fsolve via the 'SpecifyObjectiveGradient' option (via optimoptions). The 'CheckGradients' option will come in handy there.
Because fsolve usually spends the vast majority of its time computing the Jacobian via finite differences, manually computing a value for it manually will normally speed the algorithm up tremendously.
It will be faster, because
fsolve doesn't have to do extra function evaluations to do the finite differences
the convergence rate will increase due to the improved precision of the Jacobian
Especially if you use the rectangle method or trapz like above, you can reuse many of the computations you've already done for the function values themselves, meaning, even more speed-up.
Rody's answer was the correct one. Supplying the Jacobian was the single largest factor. Especially with the vectorized version, there were 3 orders of magnitude of difference in speed with the Jacobian supplied and not.
I had trouble finding information about this subject online so I'll spell it out here for future reference: It is possible to vectorize independant parallel equations with fsolve() with great gains.
I also did some work with inlining fsolve(). After supplying the Jacobian and being smarter about the equations, the serial version of my code was mostly overhead at ~1*10^-3 s per voxel. At that point most of the time inside the function was spent passing around a options -struct and creating error-messages which are never sent + lots of unused stuff assumedly for the other optimization functions inside the optimisation function (levenberg-marquardt for me). I succesfully butchered the function fsolve and some of the functions it calls, dropping the time to ~1*10^-4s per voxel on my machine. So if you are stuck with a serial implementation e.g. because of having to rely on the previous results it's quite possible to inline fsolve() with good results.
The vectorized version provided the best results in my case, with ~5*10^-5 s per voxel.

Using Linear Prediction Over Time Series to Determine Next K Points

I have a time series of N data points of sunspots and would like to predict based on a subset of these points the remaining points in the series and then compare the correctness.
I'm just getting introduced to linear prediction using Matlab and so have decided that I would go the route of using the following code segment within a loop so that every point outside of the training set until the end of the given data has a prediction:
%x is the data, training set is some subset of x starting from beginning
%'unknown' is the number of points to extend the prediction over starting from the
%end of the training set (i.e. difference in length of training set and data vectors)
%x_pred is set to x initially
p = length(training_set);
coeffs = lpc(training_set, p);
for i=1:unknown
nextValue = -coeffs(2:end) * x_pred(end-unknown-1+i:-1:end-unknown-1+i-p+1)';
x_pred(end-unknown+i) = nextValue;
end
error = norm(x - x_pred)
I have three questions regarding this:
1) Does this appropriately do what I have described? I ask because my error seems rather large (>100) when predicting over only the last 20 points of a dataset that has hundreds of points.
2) Am I interpreting the second argument of lpc correctly? Namely, that it means the 'order' or rather number of points that you want to use in predicting the next point?
3) If this is there a more efficient, single line function in Matlab that I can call to replace the looping and just compute all necessary predictions for me given some subset of my overall data as a training set?
I tried looking through the lpc Matlab tutorial but it didn't seem to do the prediction as I have described my needs require. I have also been using How to use aryule() in Matlab to extend a number series? as a reference.
So after much deliberation and experimentation I have found the above approach to be correct and there does not appear to be any single Matlab function to do the above work. The large errors experienced are reasonable since I am using a linear prediction algorithm for a problem (i.e. sunspot prediction) that has inherent nonlinear behavior.
Hope this helps anyone else out there working on something similar.

Simple binary logistic regression using MATLAB

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.

How do I use scipy.optimize to minimize a set of functions?

[EDIT: The fmin() method is a good choice for my problem. However, my problem was that one of the axes was a sum of the other axes. I wasn't recalculating the y axis after applying the multiplier. Thus, the value returned from my optimize function was always returning the same value. This gave fmin no direction so it's chosen multipliers were very close together. Once the calculations in my optimize function were corrected fmin chose a larger range.]
I have two datasets that I want to apply multipliers to to see what values could 'improve' their correlation coefficients.
For example, say data set 1 has a correlation coefficient of -.6 and data set 2 has .5.
I can apply different multipliers to each of these data sets that might improve the coefficient. I would like to find a set of multipliers to choose for these two data sets that optimizing the correlation coefficients of each set.
I have written an objective function that takes a list of multipliers, applies them to the data sets, calculates the correlation coefficient (scipy.stats.spearmanr()), and sums these coefficients. So I need to use something from scipy.optimize to pass a set of multipliers to this function and find the set that optimizes this sum.
I have tried using optimize.fmin and several others. However, I want the optimization technique to use a much larger range of multipliers. For example, my data sets might have values in the millions, but fmin will only choose multipliers around 1.0, 1.05, etc. This isn't a big enough value to modify these correlation coefficients in any meaningful way.
Here is some sample code of my objective function:
def objective_func(multipliers):
for multiplier in multipliers:
for data_set in data_sets():
x_vals = getDataSetXValues()
y_vals = getDataSetYValues()
xvals *= muliplier
coeffs.append(scipy.stats.spearmanr(x_vals, y_vals)
return -1 * sum(coeffs)
I'm using -1 because I actually want the biggest value, but fmin is for minimization.
Here is a sample of how I'm trying to use fmin:
print optimize.fmin(objective_func)
The multipliers start at 1.0 and just range between 1.05, 1.0625, etc. I can see in the actual fmin code where these values are chosen. I ultimately need another method to call to give the minimization a range of values to check for, not all so closely related.
Multiplying the x data by some factor won't really change the Spearman rank-order correlation coefficient, though.
>>> x = numpy.random.uniform(-10,10,size=(20))
>>> y = numpy.random.uniform(-10,10,size=(20))
>>> scipy.stats.spearmanr(x,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*10,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*1e6,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*1e-16,y)
(-0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*(-2),y)
(0.24661654135338346, 0.29455199407204263)
>>> scipy.stats.spearmanr(x*(-2e6),y)
(0.24661654135338346, 0.29455199407204263)
(The second term in the tuple is the p value.)
You can change its sign, if you flip the signs of the terms, but the whole point of Spearman correlation is that it tells you the degree to which any monotonic relationship would capture the association. Probably that explains why fmin isn't changing the multiplier much: it's not getting any feedback on direction, because the returned value is constant.
So I don't see how what you're trying to do can work.
I'm also not sure why you've chosen the sum of all the the Spearman coefficients and the p values as what you're trying to maximize: the Spearman coefficients can be negative, so you probably want to square them, and you haven't mentioned the p values, so I'm not sure why you're throwing them in.
[It's possible I guess that we're working with different scipy versions and our spearmanr functions return different things. I've got 0.9.0.]
You probably don't want to minimize the sum of coefficients but the sum of squares. Also, if the multipliers can be chosen independently, why are you trying to optimize them all at the same time? Can you post your current code and some sample data?

How to find minimum of nonlinear, multivariate function using Newton's method (code not linear algebra)

I'm trying to do some parameter estimation and want to choose parameter estimates that minimize the square error in a predicted equation over about 30 variables. If the equation were linear, I would just compute the 30 partial derivatives, set them all to zero, and use a linear-equation solver. But unfortunately the equation is nonlinear and so are its derivatives.
If the equation were over a single variable, I would just use Newton's method (also known as Newton-Raphson). The Web is rich in examples and code to implement Newton's method for functions of a single variable.
Given that I have about 30 variables, how can I program a numeric solution to this problem using Newton's method? I have the equation in closed form and can compute the first and second derivatives, but I don't know quite how to proceed from there. I have found a large number of treatments on the web, but they quickly get into heavy matrix notation. I've found something moderately helpful on Wikipedia, but I'm having trouble translating it into code.
Where I'm worried about breaking down is in the matrix algebra and matrix inversions. I can invert a matrix with a linear-equation solver but I'm worried about getting the right rows and columns, avoiding transposition errors, and so on.
To be quite concrete:
I want to work with tables mapping variables to their values. I can write a function of such a table that returns the square error given such a table as argument. I can also create functions that return a partial derivative with respect to any given variable.
I have a reasonable starting estimate for the values in the table, so I'm not worried about convergence.
I'm not sure how to write the loop that uses an estimate (table of value for each variable), the function, and a table of partial-derivative functions to produce a new estimate.
That last is what I'd like help with. Any direct help or pointers to good sources will be warmly appreciated.
Edit: Since I have the first and second derivatives in closed form, I would like to take advantage of them and avoid more slowly converging methods like simplex searches.
The Numerical Recipes link was most helpful. I wound up symbolically differentiating my error estimate to produce 30 partial derivatives, then used Newton's method to set them all to zero. Here are the highlights of the code:
__doc.findzero = [[function(functions, partials, point, [epsilon, steps]) returns table, boolean
Where
point is a table mapping variable names to real numbers
(a point in N-dimensional space)
functions is a list of functions, each of which takes a table like
point as an argument
partials is a list of tables; partials[i].x is the partial derivative
of functions[i] with respect to 'x'
epilson is a number that says how close to zero we're trying to get
steps is max number of steps to take (defaults to infinity)
result is a table like 'point', boolean that says 'converged'
]]
-- See Numerical Recipes in C, Section 9.6 [http://www.nrbook.com/a/bookcpdf.php]
function findzero(functions, partials, point, epsilon, steps)
epsilon = epsilon or 1.0e-6
steps = steps or 1/0
assert(#functions > 0)
assert(table.numpairs(partials[1]) == #functions,
'number of functions not equal to number of variables')
local equations = { }
repeat
if Linf(functions, point) <= epsilon then
return point, true
end
for i = 1, #functions do
local F = functions[i](point)
local zero = F
for x, partial in pairs(partials[i]) do
zero = zero + lineq.var(x) * partial(point)
end
equations[i] = lineq.eqn(zero, 0)
end
local delta = table.map(lineq.tonumber, lineq.solve(equations, {}).answers)
point = table.map(function(v, x) return v + delta[x] end, point)
steps = steps - 1
until steps <= 0
return point, false
end
function Linf(functions, point)
-- distance using L-infinity norm
assert(#functions > 0)
local max = 0
for i = 1, #functions do
local z = functions[i](point)
max = math.max(max, math.abs(z))
end
return max
end
You might be able to find what you need at the Numerical Recipes in C web page. There is a free version available online. Here (PDF) is the chapter containing the Newton-Raphson method implemented in C. You may also want to look at what is available at Netlib (LINPack, et. al.).
As an alternative to using Newton's method the Simplex Method of Nelder-Mead is ideally suited to this problem and referenced in Numerical Recpies in C.
Rob
You are asking for a function minimization algorithm. There are two main classes: local and global. Your problem is least squares so both local and global minimization algorithms should converge to the same unique solution. Local minimization is far more efficient than global so select that.
There are many local minimization algorithms but one particularly well suited to least squares problems is Levenberg-Marquardt. If you don't have such a solver to hand (e.g. from MINPACK) then you can probably get away with Newton's method:
x <- x - (hessian x)^-1 * grad x
where you compute the inverse matrix multiplied by a vector using a linear solver.
Since you already have the partial derivatives, how about a general gradient-descent approach?
Maybe you think you have a good-enough solution, but for me, the easiest way to think about this is to understand it in the 1-variable case first, and then extend it to the matrix case.
In the 1-variable case, if you divide the first derivative by the second derivative, you get the (negative) step size to your next trial point, e.g. -V/A.
In the N-variable case, the first derivative is a vector and the second derivative is a matrix (the Hessian). You multiply the derivative vector by the inverse of the second derivative, and the result is the negative step-vector to your next trial point, e.g. -V*(1/A)
I assume you can get the 2nd-derivative Hessian matrix. You will need a routine to invert it. There are plenty of these around in various linear algebra packages, and they are quite fast.
(For readers who are not familiar with this idea, suppose the two variables are x and y, and the surface is v(x,y). Then the first derivative is the vector:
V = [ dv/dx, dv/dy ]
and the second derivative is the matrix:
A = [dV/dx]
[dV/dy]
or:
A = [ d(dv/dx)/dx, d(dv/dy)/dx]
[ d(dv/dx)/dy, d(dv/dy)/dy]
or:
A = [d^2v/dx^2, d^2v/dydx]
[d^2v/dxdy, d^2v/dy^2]
which is symmetric.)
If the surface is parabolic (constant 2nd derivative) it will get to the answer in 1 step. On the other hand, if the 2nd derivative is very not-constant, you could encounter oscillation. Cutting each step in half (or some fraction) should make it stable.
If N == 1, you'll see that it does the same thing as in the 1-variable case.
Good luck.
Added: You wanted code:
double X[N];
// Set X to initial estimate
while(!done){
double V[N]; // 1st derivative "velocity" vector
double A[N*N]; // 2nd derivative "acceleration" matrix
double A1[N*N]; // inverse of A
double S[N]; // step vector
CalculateFirstDerivative(V, X);
CalculateSecondDerivative(A, X);
// A1 = 1/A
GetMatrixInverse(A, A1);
// S = V*(1/A)
VectorTimesMatrix(V, A1, S);
// if S is small enough, stop
// X -= S
VectorMinusVector(X, S, X);
}
My opinion is to use a stochastic optimizer, e.g., a Particle Swarm method.