I'm spinning up on both integer linear programming and Matlab's Problem-Based Workflow. The Mixed Integer Linear Program (MILP) example, ends with references to stopping tolerances:
StepTolerance is generally used as a relative bound, meaning
iterations end when $|(x_i – x_{i+1})| < StepTolerance*(1 +
|x_i|)$, or a similar relative measure
I would have expected a relative tolerance to be used in a manner similar to one of the following:
| x_i - x_{i+1} | < StepTolerance |x_i|
| x_i - x_{i+1} | < StepTolerance | x_i + x_{i+1} |
| x_i - x_{i+1} | < StepTolerance | x_i + x_{i+1} | / 2
Can anyone please explain how the documented behaviour is "relative"?
For me, the same question applies to the (apparently relative) objective FunctionTolerance immediately following the StepTolerance section of the aforementioned stopping tolerances page.
Thanks.
Additional context
The cited MILP example ends in a Matlab message about gap tolerance, which pops up a help box when clicked. In turn, the help box links to the aforementioned stopping tolerances page. It is possible that the stopping tolerances page was too general a reference for the MILP example, and I'll mention this in the original post. However, I'm still curious about how to interpret the relativeness of the tolerance in the more general sense, e.g., say, in a gradient descent context.
Related
I have been working on this all day but I haven't figured it out yet. So I thought I may as well ask on here and see if someone can help.
The problem is as follow:
----------
F(input)(t) --> | | --> F(output)(t)
----------
Given a sample with a known length, density, and spring constant (or young's modulus), find the 'output' force against time when a known variable force is applied at the 'input'.
My current solution can already discretise the sample into finite elements, however I am struggling to figure out how the force should transmit given that the change in transmission speed in the material changes itself with respect to the force (using equation c = sqrt(force*area/density)).
If someone could point me to a solution or any other helpful resources, it would be highly appreciated.
A method for applying damping to the system would also be helpful but I should be able to figure out that part myself. (losses to the environment via sound or internal heating)
I will remodel the porbem in the following way:
___ ___
F_input(t) --> |___|--/\/\/\/\/\/\/\/\--|___|
At time t=0 the system is in equilibrium, the distance between the two objects is L, the mass of the left one (object 1) is m1 and the mass of the right one (object 2) is m2.
___ ___
F_input(t) --> |<-x->|___|--/\/\/\/\/\/-|<-y->|___|
During the application of the force F_input(t), at time t > 0, denote by x the oriented distance of the position of object 1 from its original position at time t=0. Similarly, at time t > 0, denote by y the oriented distance of the position of object 2 from its original position at time t=0 (see the diagram above). Then the system is subject to the following system of ordinary differential equations:
x'' = -(k/m1) * x + (k/m2) * y + F_input(t)/m2
y'' = (k/m2) * x - (k/m2) * y
When you solve it, you get the change of x and y with time, i.e. you get two functions x = x(t), y = y(t). Then, the output force is
F_output(t) = m2 * y''(t)
The problem isn't well defined at all. For starters for F_out to exist, there must be some constraint it must obey. Otherwise, the system will have more unknowns than equations.
The discretization will lead you to a system like
M*xpp = -K*x + F
with m=ρ*A*Δx and k=E*A/Δx
But to solve this system with n equations, you either need to know F_in and F_out, or prescribe the motion of one of the nodes, like x_n = 0, which would lead to xpp_n = 0
As far as damping, usually, you employ proportional damping, with a damping matrix proportional to the stiffness matrix D = α*K multiplied by the vector of speeds.
I am running a linear regression with fixed effect and standard errors clustered by a certain group.
areg ref1 ref1_l1 rf1 ew1 vol_ew1 sk_ew1, a(us_id) vce(cluster us_id)
The one line code is as above and the output is as follows:
Now, the t-stats and the P values look inconsistent. How can we have t-stat >5 and pval >11%?. Similarly the 95% confidence intervals appear to be way wider than Coeff. +- 2 Std. Err.
What am I missing?
There is nothing inconsistent here. You have a small sample size and a less than parsimonious model and have all but run out of degrees of freedom. Notice how areg won't post an F statistic or a P-value for the model, a strong danger sign. Your t statistics are consistent with checks by hand:
. display 2 * ttail(1, 5.54)
.11368912
. display 2 * ttail(1, 113.1)
.00562868
In short, there is no bug here and no programming issue. It's just a matter of your model over-fitting your data and the side-effects of that.
Similarly, +/- 2 SE for a 95% confidence interval is way off as a rule of thumb here. Again, a hand calculation is instructive:
. display invt(1, 0.975)
12.706205
. display invt(60, 0.975)
2.0002978
. display invt(61, 0.975)
1.9996236
. display invnormal(0.975)
1.959964
I am using the fmincon function in Matlab. I have been trying to figure out what 'constrviolation' means when you run the function and call output. When you get infeasible solution or the solver end prematurely, you get a non-zero (& non-integer) constrviolation.
I put in a screen shot for reference.
I have searched the documentation and it says it means "Maximum of constraint functions" and I have no idea what that means. It's not an integer number so my first guess was that it is the percentage of constraints violated (or satisfied).
Any help would be appreciated.
Just interpreting the docs given some optimization-background:
constrviolation
Maximum of constraint functions
This is just the maximum of all absolute constraint-function errors
Example:
x0 + x1 = 1
x0 + x1 + x2 = 2
Somehow the solution is:
x = [0.6, 0.5, 0.9]
constrviolation is:
max( abs( 0.6 + 0.5 - 1 ), abs( 0.6 + 0.5 + 0.9 - 2 ) ) = max( 0.1, 0 ) = 0.1
This is bad and technically means: your solution is infeasible! (should converge to zero; e.g. 1e-8)
As the solver did not end very gracefully, it can't give you a real status about the problem (feasible vs. infeasible).
It might be valuable to add: Interior-point algorithms (like used here) might (some do, some don't) iterate through infeasible solutions and only finally converge to a feasible one (if existing)!
Also bad:
firstorderopt
Measure of first-order optimality
Should converge to zero too (e.g. 1e-8)! Not achieved in your example!
Now there are many possible reasons why this is happening. As you did not provide any code, we only can guess (and won't be happy about it).
You probably hit some iteration-limit like MaxFunctionEvaluations or MaxIterations. The ratio of funcCount and iterations look like numerical-differentiation, which can push the number of functions calls a lot!
I am trying to use vowpal wabbit for logistic regression. I am not sure if this is the right syntax to do it
For training, I do
./vw -d ~/Desktop/new_data.txt --passes 20 --binary --cache_file cache.txt -f lr.vw --loss_function logistic --l1 0.05
For testing I do
./vw -d ~/libsvm-3.18_test/matlab/new_data_test.txt --binary -t -i lr.vw -p predictions.txt -r raw_score.txt
Here is a snippet from my train data
-1:1.00038 | 110:0.30103 262:0.90309 689:1.20412 1103:0.477121 1286:1.5563 2663:0.30103 2667:0.30103 2715:4.63112 3012:0.30103 3113:8.38411 3119:4.62325 3382:1.07918 3666:1.20412 3728:5.14959 4029:0.30103 4596:0.30103
1:2601.25 | 32:2.03342 135:3.77379 146:3.19535 284:2.5563 408:0.30103 542:3.80618 669:1.07918 689:2.25527 880:0.30103 915:1.98227 1169:5.35371 1270:0.90309 1425:0.30103 1621:0.30103 1682:0.30103 1736:3.98227 1770:0.60206 1861:4.34341 1900:3.43136 1905:7.54141 1991:5.33791 2437:0.954243 2532:2.68664 3370:2.90309 3497:0.30103 3546:0.30103 3733:0.30103 3963:0.90309 4152:3.23754 4205:1.68124 4228:0.90309 4257:1.07918 4456:0.954243 4483:0.30103 4766:0.30103
Here is a snippet from my test data
-1 | 110:0.90309 146:1.64345 543:0.30103 689:0.30103 1103:0.477121 1203:0.30103 1286:2.82737 1892:0.30103 2271:0.30103 2715:4.30449 3012:0.30103 3113:7.99039 3119:4.08814 3382:1.68124 3666:0.60206 3728:5.154 3960:0.778151 4309:0.30103 4596:0.30103 4648:0.477121
However, if I look at the results, the predictions are all -1 and the raw scores are all 0s. I have around 200,000 examples, out of which 100 are +1 and the rest are -1. To handle this unbalanced data, I gave the positive examples weight of 200,000/100 and the negative example weight of 200,000/(200000-100). Is it because my data is like really highly unbalanced even though I adjust the weights that this is happening?
I was expecting the output of (P(y|x)) in the raw score file. But I get all zeros. I just need the probability outputs. Any suggestions what's going on guys?
A similar question was posted on the vw mailing list. I'll try to summarize the main points in all responses here for the benefit of future users.
Unbalanced training sets best practices:
Your training set is highly unbalanced (200,000 to 100). This means that only 0.0005 (0.05%) of examples have a label of 1. By always predicting -1, the classifier achieves a remarkable accuracy of 99.95%. In other words, if the cost of a false-positive is equal to the cost of a false-negative, this is actually an excellent classifier. If you are looking for an equal-weighted result, you need to do two things:
Reweigh your examples so the smaller group would have equal weight to the larger one
Reorder/shuffle the examples so positives and negatives are intermixed.
The 2nd point is especially important in online-learning where the learning rate decays with time. It follows that the ideal order, assuming you are allowed to freely reorder (e.g. no time-dependence between examples), for online-learning is a completely uniform shuffle (1, -1, 1, -1, ...)
Also note that the syntax for the example-weights (assuming a 2000:1 prevalence ratio) needs to be something like the following:
1 2000 optional-tag| features ...
-1 1 optional-tag| features ...
And as mentioned above, breaking down the single 2000 weighted example to have only a weight of 1 while repeating it 2000 times and interleaving it with the 2000 common examples (those with the -1 label) instead:
1 | ...
-1 | ...
1 | ... # repeated, very rare, example
-1 | ...
1 | ... # repeated, very rare, example
Should lead to even better results in terms of smoother convergence and lower training loss. *Caveat: as a general rule repeating any example too much, like in the case of a 1:2000 ratio, is very likely to lead to over-fitting the repeated class. You may want to counter that by slower learning (using --learning_rate ...) and/or randomized resampling: (using --bootstrap ...)
Consider downsampling the prevalent class
To avoid over-fitting: rather than overweighting the rare class by 2000x, consider going the opposite way and "underweight" the more common class by throwing away most of its examples. While this may sound surprising (how can throwing away perfectly good data be beneficial?) it will avoid over-fitting of the repeated class as described above, and may actually lead to better generalization. Depending on the case, and costs of a false classification, the optimal down-sampling factor may vary (it is not necessarily 1/2000 in this case but may be anywhere between 1 and 1/2000). Another approach requiring some programming is to use active-learning: train on a very small part of the data, then continue to predict the class without learning (-t or zero weight); if the class is the prevalent class and the online classifier is very certain of the result (predicted value is extreme, or very close to -1 when using --link glf1), throw the redundant example away. IOW: focus your training on the boundary cases only.
Use of --binary (depends on your need)
--binary outputs the sign of the prediction (and calculates progressive loss accordingly). If you want probabilities, do not use --binary and pipe vw prediction output into utl/logistic (in the source tree). utl/logistic will map the raw prediction into signed probabilities in the range [-1, +1].
One effect of --binary is misleading (optimistic) loss. Clamping predictions to {-1, +1}, can dramatically increase the apparent accuracy as every correct prediction has a loss of 0.0. This might be misleading as just adding --binary often makes it look as if the model is much more accurate (sometimes perfectly accurate) than without --binary.
Update (Sep 2014): a new option was recently added to vw: --link logistic which implements [0,1] mapping, while predicting, inside vw. Similarly, --link glf1 implements the more commonly needed [-1, 1] mapping. mnemonic: glf1 stands for "generalized logistic function with a [-1, 1] range"
Go easy on --l1 and --l2
It is a common mistake to use high --l1 and/or --l2 values. The values are used directly per example, rather than, say, relative to 1.0. More precisely: in vw: l1 and l2 apply directly to the sum of gradients (or the "norm") in each example. Try to use much lower values, like --l1 1e-8. utl/vw-hypersearch can help you with finding optimal values of various hyper-parameters.
Be careful with multiple passes
It is a common mistake to use --passes 20 in order to minimize training error. Remember that the goal is to minimize generalization error rather than training error. Even with the cool addition of holdout (thanks to Zhen Qin) where vw automatically early-terminates when error stops going down on automatically held-out data (by default every 10th example is being held-out), multiple passes will eventually start to over-fit the held-out data (the "no free lunch" principle).
Summarizing the detailed answer by arielf.
It is important to know what is the intended final cost (loss) function:
Logistic loss, 0/1 loss (ie. accuracy), F1 score, Area Under RO Curve, something else?
Here is a Bash code for part of arielf's answer.
Note that we should first delete the strange attempts of importance weighting from train.txt (I mean the ":1.00038" and ":2601.25" in the question).
A. Prepare the training data
grep '^-1' train.txt | shuf > neg.txt
grep '^1' train.txt | shuf > p.txt
for i in `seq 2000`; do cat p.txt; done > pos.txt
paste -d '\n' neg.txt pos.txt > newtrain.txt
B. Train model.vw
# Note that passes=1 is the default.
# With one pass, holdout_off is the default.
`vw -d newtrain.txt --loss_function=logistic -f model.vw`
#average loss = 0.0953586
C. Compute test loss using vw
`vw -d test.txt -t -i model.vw --loss_function=logistic -r
raw_predictions.txt`
#average loss = 0.0649306
D. Compute AUROC using http://osmot.cs.cornell.edu/kddcup/software.html
cut -d ' ' -f 1 test.txt | sed -e 's/^-1/0/' > gold.txt
$VW_HOME/utl/logistic -0 raw_predictions.txt > probabilities.txt
perf -ROC -files gold.txt probabilities.txt
#ROC 0.83484
perf -ROC -plot roc -files gold.txt probabilities.txt | head -n -2 > graph
echo 'plot "graph"' | gnuplot -persist
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 13 years ago.
Improve this question
I have six parametric equations using 18 (not actually 26) different variables, 6 of which are unknown.
I could sit down with a couple of pads of paper and work out what the equations for each of the unknowns are, but is there a simple programatic solution (I'm thinking in Matlab) that will spit out the six equations I'm looking for?
EDIT:
Shame this has been closed, but I guess I can see why. In case anyone is still interested, the equations are (I believe) non-linear:
r11^2 = (l_x1*s_x + m_x)^2 + (l_y1*s_y + m_y)^2
r12^2 = (l_x2*s_x + m_x)^2 + (l_y2*s_y + m_y)^2
r13^2 = (l_x3*s_x + m_x)^2 + (l_y3*s_y + m_y)^2
r21^2 = (l_x1*s_x + m_x - t_x)^2 + (l_y1*s_y + m_y - t_y)^2
r22^2 = (l_x2*s_x + m_x - t_x)^2 + (l_y2*s_y + m_y - t_y)^2
r23^2 = (l_x3*s_x + m_x - t_x)^2 + (l_y3*s_y + m_y - t_y)^2
(Squared the rs, good spot #gnovice!)
Where I need to find t_x t_y m_x m_y s_x and s_y
Why am I calculating these? There are two points p1 (at 0,0) and p2 at(t_x,t_y), for each of three coordinates (l_x,l_y{1,2,3}) I know the distances (r1 & r2) to that point from p1 and p2, but in a different coordinate system. The variables s_x and s_y define how much I'd need to scale the one set of coordinates to get to the other, and m_x, m_y how much I'd need to translate (with t_x and t_y being a way to account for rotation differences in the two systems)
Oh! And I forgot to mention, I also know that the point (l_x,l_y) is below the highest of p1 and p2, ie l_y < max(0,t_y) as well as l_y > 0 and l_y < t_y.
It does seem specific enough that I might have to just get my pad out and work it through mathematically!
If you have the Symbolic Toolbox, you can use the SOLVE function. For example:
>> solve('x^2 + y^2 = z^2','z') %# Solve for the symbolic variable z
ans =
(x^2 + y^2)^(1/2)
-(x^2 + y^2)^(1/2)
You can also solve a system of N equations for N variables. Here's an example with 2 equations, 2 unknowns to solve for (x and y), and 6 parameters (a through f):
>> S = solve('a*x + b*y = c','d*x - e*y = f','x','y')
>> S.x
ans =
(b*f + c*e)/(a*e + b*d)
>> S.y
ans =
-(a*f - c*d)/(a*e + b*d)
Are they linear? If so, then you can use principles of linear algebra to set up a 6x6 matrix that represents the system of equations, and solve for it using any standard matrix inversion routine...
if they are not linear, they you need to use numerical analysis methods.
As I recall from many years ago, I believe you then create a system of linear approximations to the non-linear equations, and solve that linear system, over and over again iteratively, feeding the answers back into the inputs each time, until some error metric gets sufficiently small to indicate you have reached the solution. It's obviously done with a computer, and I'm sure there are numerical analysis software packages that will do this for you, although I imagine that as any arbitrary system of non-linear equations can include almost infinite degree of different types and levels of complexity, that these software packages can't create the linear approximations for you, (except maybe in the most straightforward standard cases) and you will have ot do that part of the thing manually.
Yes there is (assuming these are linear equations) - you do this by creating a matrix equiation which is equivalent to your 6 linear equations, for example if you had the two equatrions:
6x + 12y = 9
7x - 8y = 14
This could be equivalently represented as:
|6 12| |a| |9 |
|7 -8| |b| = |14|
(Where the 2 matrices are multipled together). Matlab can then solve this for the solution matrix (a, b).
I don't have matlab installed, so I'm afraid I'm going to have to leave the details up to you :-)
As mentioned above, the answer will depend on whether your equations are linear or nonlinear. For linear systems, you can set up a simple matrix system (but don't use matrix inversion, use LU decomposition (if your system is well-conditioned) ).
For non-linear systems, you'll need to use a more advanced solver, most likely some variation on Newton's method. Essentially you'll give Matlab your six equations, and ask it to simultaneously solve for the root (zero) of all of the equations. There are several caveats and complications that come into play when dealing with non-linear systems, one of which is the need for an initial guess that assigns each of your six unknown variables a value close to the true solution. Without a good initial guess, the solver may take a long time finding a solution, or may not converge to a solution at all, even if one exists.
Decades ago, MIT developed MACSYMA, a symbolic algebra system for just this kind of thing. MIT sold MACSYMA to Symbolics, which has pretty well folded, dried up, and blown away. However, because of the miracle of military funding, an early version of MACSYMA was required to be released to the government. THAT version was subsequently released under the GPL, and is continuing to be maintained, under the name MAXIMA.
See http://maxima.sourceforge.net/ for more information.