I am building receiver operating characteristic (ROC) curves to evaluate classifiers using the area under the curve (AUC) (more details on that at end of post). Unfortunately, points on the curve often go below the diagonal. For example, I end up with graphs that look like the one here (ROC curve in blue, identity line in grey) :
The the third point (0.3, 0.2) goes below the diagonal. To calculate AUC I want to fix such recalcitrant points.
The standard way to do this, for point (fp, tp) on the curve, is to replace it with a point (1-fp, 1-tp), which is equivalent to swapping the predictions of the classifier. For instance, in our example, our troublesome point A (0.3, 0.2) becomes point B (0.7, 0.8), which I have indicated in red in the image linked to above.
This is about as far as my references go in treating this issue. The problem is that if you add the new point into a new ROC (and remove the bad point), you end up with a nonmonotonic ROC curve as shown (red is the new ROC curve, and dotted blue line is the old one):
And here I am stuck. How can I fix this ROC curve?
Do I need to re-run my classifier with the data or classes somehow transformed to take into account this weird behavior? I have looked over a relevant paper, but if I am not mistaken, it seems to be addressing a slightly different problem than this.
In terms of some details: I still have all the original threshold values, fp values, and tp values (and the output of the original classifier for each data point, an output which is just a scalar from 0 to 1 that is a probability estimate of class membership). I am doing this in Matlab starting with the perfcurve function.
Note based on some very helpful emails about this from the people that wrote the articles cited above, and the discussion above, the right answer seems to be: do not try to "fix" individual points in an ROC curve unless you build an entirely new classifier, and then be sure to leave out some test data to see if that was a reasonable thing to do.
Getting points below the identity line is something that simply happens. It's like getting an individual classifier that scores 45% correct even though the optimal theoretical minimum is 50%. That's just part of the variability with real data sets, and unless it is significantly less than expected based on chance, it isn't something you should worry too much about. E.g., if your classifier gets 20% correct, then clearly something is amiss and you might look into the specific reasons and fix your classifier.
Yes, swapping a point for (1-fp, 1-tp) is theoretically effective, but increasing sample size is a safe bet too.
It does seem that your system has a non-monotonic response characteristic so be careful not to bend the rules of the ROC too much or you will impact the robustness of the AUC.
That said, you could try to use a Pareto Frontier Curve (Pareto Front). If that fits the requirements of "Repairing Concavities" then you'll basically sort the points so that the ROC curve becomes monotonic.
Related
I have a question about curve fitting, I have many curves like the one in the picture.
X axis : time
Y axis : temperature
Each sample comes out every 30s.
GOAL : predict the value at the end of the transient
What would you do in this situation?
What I am doing is this :
for every new sample I start a new fitting (and so each fitting is independent from the previous one) and check the value of the fitted curve 2 hours (all curves I have set before 2h) after the start of the measurement. If for a number (let's say 5) of subsequent fitting the value in the future stays more or less the same(+-0.2°C) I so assume that the estimation is the right one.
This approach seems to me far too simple and I think I am not exploiting all information. For example the info of the error I am making punctually (e.g. at minute 4:00 I predict and at 4:30 I see that I am doing an error).
In the picture the red part of the curve is excluded (but the real data in the future passes through it). the estimation is the blue one. You see in this case I don't have a good prediction... In general I have also more flat curves.
Based on the comments above, I tried to formulate an answer as no one else is giving some input.
I think your are using a good basic procedure. Better results may be obtained by using a more appropriate fitting curve, which includes all the dominant dynamics, but avoids overfitting of the data. Based on your figure, the simplest form I could think of is:
s + a(1-e^(-t/tau))
with parameters s (the initial temperature), a (amplitude = steady state value) and tau (dominant time constant). As you mentioned yourself, limiting the allowed range for the parameters may avoid overfitting and increase the quality of your estimation.
Using a random high order function, like you are using now, may give good interpolation results, but are dangerous to use for extrapolation, because strange effects may occur outside the fitting region.
Alternatives
Using the error (eg. correcting for the extrapolated error) may be possible, but is tricky and may not always give good results.
Training a neural network to perform the estimation is probably overkill, but may give better results if applied correctly. Note that you need a lot of training data which should be representative for the data for which you will use the neural network later on.
I wanted to extrapolate some of the data I had, as shown in the plot below. The blue line is the original data and the red line is the extrapolation that I wanted.
To use regression analysis, I used the function polyfit:
sizespecial = size(i_C);
endgoal = sizespecial(2);
plothelp = 1:endgoal;
reg1 = polyfit(plothelp,i_C,2);
reg2 = polyfit(plothelp,i_D,2);
Where i_C and i_D are the vectors that represent the original data. I extended the data by using this code:
plothelp=1:endgoal+11;
for in = endgoal+1:endgoal+11
i_C(in) = (reg1(1)*(in^2))+(reg1(2)*in)+reg1(3);
i_D(in) = (reg2(1)*(in^2))+(reg2(2)*in)+reg2(3);
end
However, the graph I output now is:
I do not understand why the extra notch is introduced (circled in red). Do not hesitate to ask me to clarify any of the details on this questions and thank you for all your answers.
What I imagine is happening is that you are trying fit a second order polynomial over all your data. My guess is that this polynomial will look a lot like the curve I have drawn in in orange. If you follow Matt's advise from his comment and plot your regressed polynomial over the your original data as well (not just the extrapolated part) you should confirm this.
You might get better results by fitting a higher order polynomial. Your data have two points of inflection so a 3rd order polynomial will probably work quite well. One danger of extrapolating on higher order polynomial however is that they could have fairly dramatic inflections outside of the domain of your data and produce unexpected and wild results.
One way to mitigate against this is by rather performing a linear regression over the final x data points of your series. These are the points highlighted in yellow in the figure. You can tune x as a parameter such that it covers as much of the approximately linear final portion of your curve as makes sense. The red line I have drawn in will be the result of a linear regression performed on only those data (as opposed to the entire data set)
Another option might be to rather fit a spline curve and extrapolate on that. You can use the interp1 function specifying 'spline' or 'pchip' for that.
However which is the best choice will depend largely on the nature of the problem you are trying to solve.
I often find myself fitting a scatter plot, and knowing that the 'true fit' should have only one inflection point. Any ideas for forcing a fit that will obey this?
I am using Matlab and Microsoft Excel
Many thanks
Option 1:
I like to use spline smoothing with Akaike information criteria, and while it is a hyper-parametric fit and has a large number of analytic candidate inflection points, the smoothed data at the sample points tends to reveal only what is within the data.
If your data doesn't actually have an inflection point, this is indicated. If it does, it is also usually captured. Statistical jargon for an important cousin to this is called a "non-informative prior".
Try slides 30-31 here: link.
Option 2:
If you have an older version of MatLab then you can specify the exact model easily in the "cftool" (not the same as sftool) then get m-file that gives how you put it into your own script. Pick a model appropriate to your data.
I have about 100 data points which mostly satisfying a certain function (but some points are off). I would like to plot all those points in a smooth curve but the problem is the points are not uniformly distributed. So is that anyway to get the smooth curve? I am thinking to interpolate some points in between, but the only way that comes up to my mind is to linearly insert some artificial points between two data points. But that will show a pretty weird shape (like some sharp corner). So any better idea? Thanks.
If you know more or less what the actual curve should be, you can try to fit that curve to your points (e.g. using polyfit). Depending on how many points are off and how far, you can get by with least squares regression (which is fairly easy to get working). If you have too many outliers (or they are much too large/small), you can also try robust regression (e.g. least absolute deviation fitting) using the robustfit function.
If you can manually determine the outliers, you can also fit a curve through the other points to get better results or even use interpolation methods (e.g. interp1 in MATLAB) on those points to get a smoother curve.
If you know which function describes your data, robust fitting (using, e.g. ROBUSTFIT, or the new convenient functions LINEARMODEL and NONLINEARMODEL with the robust option) is a good way to go if there are outliers in your data.
If you don't know the function that describes your data, but want a smooth trendline that is little affected by outliers, SMOOTHN from the File Exchange does an excellent job in my experience.
Have you looked at the use of smoothing splines? Like interpolating splines, but with the knot points and coefficients chosen to minimise a least-squares error function. There is an excellent implementation available from Matlab central which I have used successfully.
I have a set of points, (x, y), where each y has an error range y.low to y.high. Assume a linear regression is appropriate (in some cases the data may originally have followed a power law, but has been transformed [log, log] to be linear).
Calculating a best fit line is easy, but I need to make sure the line stays within the error range for every point. If the regressed line goes outside the ranges, and I simply push it up or down to stay between, is this the best fit available, or might the slope need changed as well?
I realize that in some cases, a lower bound of 1 point and an upper bound of another point may require a different slope, in which case presumably just touching those 2 bounds is the best fit.
The constrained problem as stated can have both a different intercept and a different slope compared to the unconstrained problem.
Consider the following example (the solid line shows the OLS fit):
Now if you imagine very tight [y.low; y.high] bounds around the first two points and extremely loose bounds over the last one. The constrained fit would be close to the dotted line. Clearly, the two fits have different slopes and different intercepts.
Your problem is essentially the least squares with linear inequality constraints. The relevant algorithms are treated, for example, in "Solving least squares problems" by Charles L. Lawson and Richard J. Hanson.
Here is a direct link to the relevant chapter (I hope the link works). Your problem can be trivially transformed to Problem LSI (by multiplying your y.high constraints by -1).
As far as coding this up, I'd suggest taking a look at LAPACK: there may already be a function there that solves this problem (I haven't checked).
I know MATLAB has an optimization library that can do constrained SQP (sequential quadratic programming) and also lots of other methods for solving quadratic minimization problems with inequality constraints. The cost function you want to minimize will be the sum of the squared errors between your fit and the data. The constraints are those you mentioned. I'm sure there are free libraries that do the same thing too.