I'm writing a test tomorrow and I'm contemplating doing everything on Matlab, to save time.
Some questions require numerical integration of datapoints (points, not necessarily functions).
E.g.
C=[0 1 5 8 10 8 6 4 3 2.2 1.5 0.6 0];
I've used trapz(C) to determine the integral of the data (area under the curve) and compared that to what my textbook gets.
Often, there is too large a difference between the two.
Is there another easy and fast way that the above data can be integrated numerically using Matlab, e.g. by using Simpson's rule, Gauleg or spline?
I've taken a look at integrate(), but that seems to work only on functions?
Are your data points spaced by dx = 1? if dx is .5 for example, this would change the result by a factor of two.
Otherwise, I'd point out: the data point by themselves, assuming zero width, will produce an area of 0: the point being that your textbook must be assuming some kind of interpolation between them to get a meaningful integral. If they are straight line segements, trapz(C) should give you the correct result; if your textbook is doing something else (points taken from a smooth function, for example), it is not surprising the results would be different.
Related
I need to compute a weighted moving average withous loops and withoud storing infromation. The weight could be linear, so that the old sample is weighted less than the new one.
For example, using a 20 samples window, my weights vector would be:
[1 2 3 4 5 ... 20]
I'm using the following formula to compute the moving mean:
newMean = currMean + (newSample - currMean)/WindowSize
now I need to "inject" weight.
What I can know:
1. which sample I'm considering (14th....26th....), I can count.
2. of course, I can know currMean
What I can know but I don't want to do:
1. storing all the samples (in my case they are 1200 x 1980 x 3 matrix, I simply can't store them).
I'm currently using Matlab, but I really do not need the code, just the concept, if it exists.
Thank you.
Look into techniques in digital signal processing. You are describing a FIR filter, which can be implemented as a convolution, or as a memory efficient circuit. Basically you can rewrite it as a recursive equation that keeps only the filter-length past filtered intermediate state variables. MATLAB does this in filter function (you can chain the internal state to continue filtering). See documentation of filter and I also recommend reading a DSP textbook.
I'm trying to model basic gravitation for an object of negligible mass around a massive body. I've followed the examples provided in the ODE suite documentation, but the results I'm getting are plainly ridiculous.
Here's the function I'm calling with ode45:
function dy = rigid(t,y)
dy = zeros(4,1);
%Position
xx=y(1);
yy=y(2);
%Radius
r=(xx.^2+yy.^2).^0.5;
%Constants
M=10^30;
G=6.67*10^-11;
%dX/dt
dy(1)=y(3); %vx
dy(3)=-M.*G.*xx.*r.^-3; %ax
%dY/dt
dy(2)=y(4); %vy
dy(4)=-M.*G.*yy.*r.^-3; %ay
And here are the solver lines:
%Init
x_0=1;
y_0=1;
vx_0=0;
vy_0=5;
%ODEs
[T,Y] = ode45(#rigid,[0 1000],[x_0 y_0 vx_0 vy_0]);
%Vectors
x=Y(:,1);
y=Y(:,2);
%Plot
plot(x,y)
xlabel('x');
ylabel('y');
title('y=f(x)');
I get a linear plot at the end. Even with initial speed, the position doesn't change over a period of several steps. The only thing I can think of is that I've misunderstood the way to set up my system of ODEs.
I've been pondering this for a while now, and I'm really short on ideas having had done a few searches on the web.
Summary: There are problems in integrating Hamiltonian systems with normal numerical integrators, and your special initial conditions aggravate this to the point where the numerical solution has no resemblance with the correct one.
There's nothing wrong with your implementation per se. However, the initial conditions you use are not the best. The constants G and M you use are in SI units, which means the coordinates are in m, the speeds are in m/s, and time is in s. These lines
x_0=1;
y_0=1;
vx_0=0;
vy_0=5;
[T,Y] = ode45(#rigid,[0 1000],[x_0 y_0 vx_0 vy_0]);
therefore mean that you are asking for an orbit with a radius of about 1.4 meters and an orbital speed of 5 m/s, and you want this orbit over a period of 17 minutes. Imagine there actually was an object just meters away from a mass of 10^30 kilograms!
So let's try to ask for something more realistic, similar to Earths' orbit, and look at it over 1 year:
x_0=149.513e9;
y_0=0;
vx_0=0;
vy_0=29.78e3;
[T,Y] = ode45(#rigid,[0 31.536e6],[x_0 y_0 vx_0 vy_0]);
And the result is
which looks as expected.
But there is a second problem here. Let's look at this orbit over a period of 10 years ([0 315.36e6]):
Now we no longer get a closed orbit, but a spiral! That's because the numerical integration proceeds with limited precision, and for this set of equations this leads (physically speaking) to a loss of energy. The precision can be increased using parameters to ode45, but ultimately the problem will persist.
Now let's go back to your original parameters and have a look at the result:
This "orbit" is a straight line towards the origin (the sun). Which could be ok, since a straight oscillation is a possible special case of an elliptic orbit. But looking at the coordinates over time:
We see that there is no oscillation. The "planet" falls into the sun and stays there. What's happening here is the same effect as with the larger orbit: Imprecise integration leads to a loss of energy. Moreover, the numerical integration gets stuck: We asked for a period of 1000 s, but the integration does not proceed beyond 1.6e-10 seconds.
As far as I can tell, neither ode45 nor any of the other standard Matlab integrators are adequate to deal with this problem. There are special numerical integration algorithms designed to do so, specifically for Hamiltonian systems, called symplectic integrators. There is a file exchange entry that provides different implementations. Also see this question and answers for more pointers.
I am trying to build a receiver operating characteristic (ROC) curves to evaluate the discriminating ability of my classifier to correctly classify diseased and non-diseased subjects.
I understand that the closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. My experiments gave me quite desirable value of area under curve (auc), i.e. 0.86458. However, the ROC curve (in which I included the cut-off points for tracing purposes) seems quite strange as it gave me straight lines as below:
... and not a curve I expected and as I normally see from any references like this:
Does it hav something to do with the number of observations used? (in this case I only have 50 samples). Or is this just fine as long as the the auc value is high and that the 'curve' comes above the 45-degree diagonal of the ROC space? I would be glad if someone can share their thoughts about it. Thank you!
By the way, I used the perfcurve() function in matlab:
% ROC comparison between the proposed approach and the baseline
[X1,Y1,T1,auc1,OPTROCPT1,SUBY5,SUBYNAMES1] = perfcurve(testLabel,predlabel_prop,1);
[X2,Y2,T2,auc2,OPTROCPT2,SUBY2,SUBYNAMES2] = perfcurve(testLabel,predLabel_base,1);
figure;
plot(X1,Y1,'-r*',X2,Y2,'--ko');
legend('proposed approach','baseline','Location','east');
xlabel('False positive rate'); ylabel('True positive rate')
title('ROC comparison of the proposed approach and the baseline')
text(0.6,0.3,{'* - proposed method',strcat('Area Under Curve = ',...
num2str(auc1))},'EdgeColor','r');
text(0.6,0.15,{'o - baseline',strcat('Area Under Curve = ',num2str(auc2))},'EdgeColor','k');
You probably have too litte data.
You curve indicates your data set has 13 negative and 5 positive examples (in your test set?)
Furthermore, all but 4 have exactly the same score (maybe 0)? Or is that your cutoff?
Given this small sample size, I would not accept the hypothesis that your proposed method is better than the baseline, but accept the alternative - the methods perform as good as the other: the difference of 0.04 is much too small for this tiny sample size, the results are virtually identical. Any variation within the cut-off area (the diagonal part) can be much larger than this 0.04... On a different run, a different test set, the results may be the other way around.
Shape of your curve is just a result of high explanatory power of your model and limited number of observations (e.g. take a look at the example here http://nl.mathworks.com/help/stats/perfcurve.html).
In my project i have hige surfaces of 20.000 points computed by a algorithm. This algorithm, sometimes, has an error, computing 1 or more points in an small area incorrectly.
This error can not be solved in the algorithm, but needs to be detected afterwards.
The error can be seen in the next figure:
As you can see, there is a point wrongly computed that not only breaks the full homogeneous surface, but also destroys the aestetics of the plot (wich is also important in the project.)
Sometimes it can be more than a point, in general no more than 5 or 6. The error is allways the Z axis, so no need to check X and Y
I have been squeezing my mind to find a bit "generic" algorithm to detect this poitns.
I thougth that maybe taking patches of surface and meaning the Z, then detecting the points out of the variance... but I dont think it will work allways.
Any ideas?
NOTE: I dont want someone to write code for me, just an idea.
PD: relevant code for the avobe image:
[x,y] = meshgrid([-2:.07:2]);
Z = x.*exp(-x.^2-y.^2);
subplot(1,2,1)
surf(x,y,Z,gradient(Z))
subplot(1,2,2)
Z(35,35)=Z(35,35)+0.3;
surf(x,y,Z,gradient(Z))
The standard trick is to use a Laplacian, looking for the largest outliers. (This is not unlike what Mohsen posed for an answer, but is actually a bit easier.) You could even probably do it with conv2, so it would be pretty efficient.
I could offer a few ways to implement the idea. A simple one is to use my gridfit tool, found on the File Exchange. (Gridfit essentially uses a Laplacian for its smoothing operation.) Fit the surface with all points included, then look for the single point that was perturbed the most by the fit. Exclude it, then rerun the fit, again looking for the largest outlier. (With gridfit, you can use weights to give points a zero weight, a simple way to exclude a point or list of points.) When the largest perturbation that was needed is small enough, you can decide to stop the process. A nice thing is gridfit will also impute new values for the outliers, filling in all of the holes.
A second approach is to use the Laplacian directly, in more of a filtering approach. Here, you simply compute a value at each point that is the average of each neighbor to the left, right, above, and below. The single value that is most largely in disagreement with its computed average is replaced with a new value. Or, you can use a weighted average of the new value with the old one there. Again, iterate until the process does not generate anything larger than some tolerance. (This is the basis of an old outlier detection and correction scheme that I recall from the Fortran IMSL libraries, but probably dates back to roughly 30 years ago.)
Since your functions seems to vary smoothly these abrupt changes can be detected by looking into the derivatives. You can
Take the derivative in one direction
Calculate mean and standard deviation of derivative
Find the points by looking for points that are further from mean by certain multiple of standard deviation.
Here is the code
U=diff(Z);
V=(U-mean(U(:)))/std(U(:));
surf(x(2:end,:),y(2:end,:),V)
V=[zeros(1,size(V,2)); V];
V(abs(V)<10)=0;
V=sign(V);
W=cumsum(V);
[I,J]=find(W);
outliers = [I, J];
For your example you get this plot for V with a peak at around 21.7 while second peak is at around 1.9528, so maybe a threshold of 10 is ok.
and running the code returns
outliers =
35 35
The need for cumsum is for the cases that you have a patch of points next to each other that are incorrect.
I would like to measure the goodness-of-fit to an exponential decay curve. I am using the lsqcurvefit MATLAB function. I have been suggested by someone to do a chi-square test.
I would like to use the MATLAB function chi2gof but I am not sure how I would tell it that the data is being fitted to an exponential curve
The chi2gof function tests the null hypothesis that a set of data, say X, is a random sample drawn from some specified distribution (such as the exponential distribution).
From your description in the question, it sounds like you want to see how well your data X fits an exponential decay function. I really must emphasize, this is completely different to testing whether X is a random sample drawn from the exponential distribution. If you use chi2gof for your stated purpose, you'll get meaningless results.
The usual approach for testing the goodness of fit for some data X to some function f is least squares, or some variant on least squares. Further, a least squares approach can be used to generate test statistics that test goodness-of-fit, many of which are distributed according to the chi-square distribution. I believe this is probably what your friend was referring to.
EDIT: I have a few spare minutes so here's something to get you started. DISCLAIMER: I've never worked specifically on this problem, so what follows may not be correct. I'm going to assume you have a set of data x_n, n = 1, ..., N, and the corresponding timestamps for the data, t_n, n = 1, ..., N. Now, the exponential decay function is y_n = y_0 * e^{-b * t_n}. Note that by taking the natural logarithm of both sides we get: ln(y_n) = ln(y_0) - b * t_n. Okay, so this suggests using OLS to estimate the linear model ln(x_n) = ln(x_0) - b * t_n + e_n. Nice! Because now we can test goodness-of-fit using the standard R^2 measure, which matlab will return in the stats structure if you use the regress function to perform OLS. Hope this helps. Again I emphasize, I came up with this off the top of my head in a couple of minutes, so there may be good reasons why what I've suggested is a bad idea. Also, if you know the initial value of the process (ie x_0), then you may want to look into constrained least squares where you bind the parameter ln(x_0) to its known value.