How do I find the slope (rate) in MATLAB? - matlab

How do I find the slope (rate) in MATLAB?
For example, say I have a scatter plot:
Year = [2001 2002 2003 2004 2005];
Distance = [1.5 1.8 1.9 2.2 2.5];
scatter(Year, Distance)
hold on
pf = polyfit(Year,Distance,1);
f = polyval(pf,Year);
plot(Year,f)
And I can find R by:
[r,p] = corrcoef(Year,Distance)
I want to find the rate at which the distance increases per year, which I think is equivalent to the slope?

You are correct in your interpretation of the slope in this case. If you use polyfit in that fashion, you are finding the slope and intercept of the regression line that best fits that distribution. In this case, the slope would be the rate at which distance increases per year. Without going into much detail, polyfit will determine the line of best fit that will minimize the sum of squared errors between the best fit line and your data points. Therefore, this slope will give you the best rate at which distance is increasing per year, given your point distribution.
You can follow Chris A's approach in that you can find point-wise pairs of neighbouring points and compute a slope for each, then do an average, but doing polyfit will find the least squares regression line and in my opinion that's the way to go.
You can obtain the least squares, or best fit slope by extracting the first value of pf as you have already observed. The second value will contain the intercept term of the regression line.
Good choice on using corrcoef to determine how good the fit is. However, be careful and take the correlation coefficient with a grain of salt. Some distributions may report a good correlation coefficient, but the actual best fit line will not look very good. A classic example would be the Anscombe quartet. In this example, all distributions reported a correlation coefficient of 0.816, yet the variability in the data was quite different. As a means of self-containment, this is what the data look like as well as the best fit line through each set of points. You can see that the regression line is actually the same for all data sets, yet the point distribution is completely different:

Related

Finding defined peaks with Clusters in MATLAB

this is my problem:
I have the next data "A", which looks like:
As you can see, I have drawn with red circles the apparently peaks, the most defined are 2 and 7, I say that they are defined because its standard deviation is low in comparison with the other peaks (especially the second one).
What I need is a way (anyway) to get the values and the standard deviation of n peaks in a numeric array.
I have tried with "clusters", but I got no good results:
First of all, I used "kmeans" MATLAB function, and I realize that this algorithm doesn't group peaks as I need. As you can see in the picture above, in the red circle, that cluster has at less 3 or 4 peaks. And kmeans need that you set the number of clusters, and I need to identify it automatically.
I hope that anyone can give me some ideas, or a way to get better results, thanks.
Pd: I leave the data "A" in the next link.
https://drive.google.com/file/d/0B4WGV21GqSL5a2EyQ2l0SHZURzA/edit?usp=sharing
The problem is that your axes have very different meaning.
K-means optimizes variance. But variance in X is something entirely different than variance in Y, isn't it? Furthermore, each of these methods will split your data in both X and Y, whereas I assume you want the data to be partitioned on the X axis only.
I suggest the following: consider the Y axis to be a weight, and X axis to be a position.
Then perform weighted density estimation, and look for low density to separate your clusters.
I can't help you with MATLAB. I don't use it.
Mathematically, what you want to do is place a Gaussian at each point, with area Y and center X. Then find minima and maxima on the sum of these Gaussians. See Wikipedia, Kernel Density Estimation for details; except that you want to use the Y axis as weights. You could maybe also use 1/Y as standard deviation, if you don't want to use weights.

How to measure power spectral density in matlab?

I am trying to measure the PSD of a stochastic process in matlab, but I am not sure how to do it. I have posted the exact same question here, but I thought I might have more luck here.
The stochastic process describes wind speed, and is represented by a vector of real numbers. Each entry corresponds to the wind speed in a point in space, measured in m/s. The points are 0.0005 m apart. How do I measure and plot the PSD? Let's call the vector V. My first idea was to use
[p, w] = pwelch(V);
loglog(w,p);
But is this correct? The thing is, that I'm given an analytical expression, which the PSD should (in theory) match. By plotting it with these two lines of code, it looks all wrong. Specifically it looks as though it could need a translation and a scaling. Other than that, the shapes of the two are similar.
UPDATE:
The image above actually doesn't depict the PSD obtained by using pwelch on a single vector, but rather the mean of the PSD of 200 vectors, since these vectors stems from numerical simulations. As suggested, I have tried scaling by 2*pi/0.0005. I saw that you can actually give this information to pwelch. So I tried using the code
[p, w] = pwelch(V,[],[],[],2*pi/0.0005);
loglog(w,p);
instead. As seen below, it looks much nicer. It is, however, still not perfect. Is that just something I should expect? Taking the squareroot is not the answer, by the way. But thanks for the suggestion. For one thing, it should follow Kolmogorov's -5/3 law, which it does now (it follows the green line, which has slope -5/3). The function I'm trying to match it with is the Shkarofsky spectral density function, which is the one-dimensional Fourier transform of the Shkarofsky correlation function. Is it not possible to mark up math, here on the site?
UPDATE 2:
I have tried using [p, w] = pwelch(V,[],[],[],1/0.0005); as I was suggested. But as you can seem it still doesn't quite match up. It's hard for me to explain exactly what I'm looking for. But what I would like (or, what I expected) is that the dip, of the computed and the analytical PSD happens at the same time, and falls off with the same speed. The data comes from simulations of turbulence. The analytical expression has been fitted to actual measurements of turbulence, wherein this dip is present as well. I'm no expert at all, but as far as I know the dip happens at the small length scales, since the energy is dissipated, due to viscosity of the air.
What about using the standard equation for obtaining a PSD? I'd would do this way:
Sxx(f) = (fft(x(t)).*conj(fft(x(t))))*(dt^2);
or
Sxx = fftshift(abs(fft(x(t))))*(dt^2);
Then, if you really need, you may think of applying a windowing criterium, such as
Hanning
Hamming
Welch
which will only somehow filter your PSD.
Presumably you need to rescale the frequency (wavenumber) to units of 1/m.
The frequency units from pwelch should be rescaled, since as the documentation explains:
W is the vector of normalized frequencies at which the PSD is
estimated. W has units of rad/sample.
Off the cuff my guess is that the scaling factor is
scale = 1/0.0005/(2*pi);
or 318.3 (m^-1).
As for the intensity, it looks like taking a square root might help. Perhaps your equation reports an intensity, not PSD?
Edit
As you point out, since the analytical and computed PSD have nearly identical slopes they appear to obey similar power laws up to 800 m^-1. I am not sure to what degree you require exponents or offsets to match to be satisfied with a specific model, and I am not familiar with this particular theory.
As for the apparent inconsistency at high wavenumbers, I would point out that you are entering the domain of very small numbers and therefore (1) floating point issues and (2) noise are probably lurking. The very nice looking dip in the computed PSD on the other hand appears very real but I have no explanation for it (maybe your noise is not white?).
You may want to look at this submission at matlab central as it may be useful.
Edit #2
After inspecting documentation of pwelch, it appears that you should pass 1/0.0005 (the sampling rate) and not 2*pi/0.0005. This should not affect the slope but will affect the intercept.
The dip in PSD in your simulation results looks similar to aliasing artifacts
that I have seen in my data when the original data were interpolated with a
low-order method. To make this clearer - say my original data was spaced at
0.002m, but in the course of cleaning up missing data, trying to save space, whatever,
I linearly interpolated those data onto a 0.005m spacing. The frequency response
of linear interpolation is not well-behaved, and will introduce peaks and valleys
at the high wavenumber end of your spectrum.
There are different conventions for spectral estimates - whether the wavenumber
units are 1/m, or radians/m. Single-sided spectra or double-sided spectra.
help pwelch
shows that the default settings return a one-sided spectrum, i.e. the bin for some
frequency ω will include the power density for both +ω and -ω.
You should double check that the idealized spectrum to which you are comparing
is also a one-sided spectrum. Otherwise, you'll need to half the values of your
one-sided spectrum to get values representative of the +ω side of a
two-sided spectrum.
I agree with Try Hard that it is the cyclic frequency (generally Hz, or in this case 1/m)
which should be specified to pwelch. That said, the returned frequency vector
from pwelch is also in those units. Analytical
spectral formulae are usually written in terms of angular frequency, so you'll
want to be sure that you evaluate it in terms of radians/m, but scale back to 1/m
for plotting.

Detect incorrect points in a homogeneous surface

In my project i have hige surfaces of 20.000 points computed by a algorithm. This algorithm, sometimes, has an error, computing 1 or more points in an small area incorrectly.
This error can not be solved in the algorithm, but needs to be detected afterwards.
The error can be seen in the next figure:
As you can see, there is a point wrongly computed that not only breaks the full homogeneous surface, but also destroys the aestetics of the plot (wich is also important in the project.)
Sometimes it can be more than a point, in general no more than 5 or 6. The error is allways the Z axis, so no need to check X and Y
I have been squeezing my mind to find a bit "generic" algorithm to detect this poitns.
I thougth that maybe taking patches of surface and meaning the Z, then detecting the points out of the variance... but I dont think it will work allways.
Any ideas?
NOTE: I dont want someone to write code for me, just an idea.
PD: relevant code for the avobe image:
[x,y] = meshgrid([-2:.07:2]);
Z = x.*exp(-x.^2-y.^2);
subplot(1,2,1)
surf(x,y,Z,gradient(Z))
subplot(1,2,2)
Z(35,35)=Z(35,35)+0.3;
surf(x,y,Z,gradient(Z))
The standard trick is to use a Laplacian, looking for the largest outliers. (This is not unlike what Mohsen posed for an answer, but is actually a bit easier.) You could even probably do it with conv2, so it would be pretty efficient.
I could offer a few ways to implement the idea. A simple one is to use my gridfit tool, found on the File Exchange. (Gridfit essentially uses a Laplacian for its smoothing operation.) Fit the surface with all points included, then look for the single point that was perturbed the most by the fit. Exclude it, then rerun the fit, again looking for the largest outlier. (With gridfit, you can use weights to give points a zero weight, a simple way to exclude a point or list of points.) When the largest perturbation that was needed is small enough, you can decide to stop the process. A nice thing is gridfit will also impute new values for the outliers, filling in all of the holes.
A second approach is to use the Laplacian directly, in more of a filtering approach. Here, you simply compute a value at each point that is the average of each neighbor to the left, right, above, and below. The single value that is most largely in disagreement with its computed average is replaced with a new value. Or, you can use a weighted average of the new value with the old one there. Again, iterate until the process does not generate anything larger than some tolerance. (This is the basis of an old outlier detection and correction scheme that I recall from the Fortran IMSL libraries, but probably dates back to roughly 30 years ago.)
Since your functions seems to vary smoothly these abrupt changes can be detected by looking into the derivatives. You can
Take the derivative in one direction
Calculate mean and standard deviation of derivative
Find the points by looking for points that are further from mean by certain multiple of standard deviation.
Here is the code
U=diff(Z);
V=(U-mean(U(:)))/std(U(:));
surf(x(2:end,:),y(2:end,:),V)
V=[zeros(1,size(V,2)); V];
V(abs(V)<10)=0;
V=sign(V);
W=cumsum(V);
[I,J]=find(W);
outliers = [I, J];
For your example you get this plot for V with a peak at around 21.7 while second peak is at around 1.9528, so maybe a threshold of 10 is ok.
and running the code returns
outliers =
35 35
The need for cumsum is for the cases that you have a patch of points next to each other that are incorrect.

Polynomial data fitting in a special way in MATLAB

I have some data let's say the following vector:
[1.2 2.13 3.45 4.59 4.79]
And I want to get a polynomial function, say f to fit this data. Thus, I want to go with something like polyfit. However, what polyfit does is minimizing the sum of least square errors. But, what I want is to have
f(1)=1.2
f(2)=2.13
f(3)=3.45
f(4)=4.59
f(5)=4.79
That is to say, I want to manipulate the fitting algorithm so that it will give me the exact points that I already gave as well as some fitted values where exact values are not given.
How can I do that?
I think everyone is missing the point. You said that "That is to say, I want to manipulate the fitting algorithm so that I will give me the exact points as well as some fitted values where exact fits are not present. How can I do that?"
To me, this means you wish an exact (interpolatory) fit for a listed set, and for some other points, you want to do a least squares fit.
You COULD do that using LSQLIN, by setting a set of equality constraints on the points to be fit exactly, and then allowing the rest of the points to be fit in a least squares sense.
The problem is, this will require a high order polynomial. To be able to fit 5 points exactly, plus some others, the order of the polynomial will be quite a bit higher. And high order polynomials, especially those with constrained points, will do nasty things. But feel free to do what you will, just as long as you also expect a poor result.
Edit: I should add that a better choice is to use a least squares spline, which is something you CAN constrain to pass through a given set of points, while fitting other points in a least squares sense, and still not do something wild and crazy as a result.
Polyfit does what you want. An N-1 degree polynomial can fit N points exactly, thus, when it minimizes the sum of squared error, it gets 0 (which is what you want).
y=[1.2 2.13 3.45 4.59 4.79];
x=[1:5];
coeffs = polyfit(x,y,4);
Will get you a polynomial that goes through all of your points.
What you ask is known as Lagrange Interpolation . There is a MATLAB file exchange available. http://www.mathworks.com/matlabcentral/fileexchange/899-lagrange-polynomial-interpolation
However, you should note that least squares polynomial fitting is generally preferred to Lagrange Interpolation since the data you have in principle will have noise in it and Lagrange Interpolation will fit the noise as well as the data you have. So if you know that your data actually represents M dimensional polynomial and have N data, where N>>M, then you will have a order N polynomial with Lagrange.
You have options.
Use polyfit, just give it enough leeway to perform an exact fit. That is:
values = [1.2 2.13 3.45 4.59 4.79];
p = polyfit(1:length(values), values, length(values)-1);
Now
polyval(p,2) %returns 2.13
Use interpolation / extrapolation
values = [1.2 2.13 3.45 4.59 4.79];
xInterp = 0:0.1:6;
valueInterp = interp1(1:length(values), values, xInterp ,'linear','extrap');
Interpolation provides a lot of options for smoothing, extrapolation etc. For example, try:
valueInterp = interp1(1:length(values), values, xInterp ,'spline','extrap');

How to calculate residuals for two curves (matrixes) of different size?

I've got a theoretical curve which was calculated numerically and an experimental curve (better to say a massive of experimental points). I need to calculate the residuals between these two curves to check the accuracy of modeling with the least squares sum method. These matrixes (curves) are of different size. Is there any function in MATLAB providing the calculation of residuals for two matrixes of different size?
I thought I'll just elaborate a bit on what Aabaz said in case there are others who might find this useful (Although Aabaz's explanation is probably clear enough for people who have an understanding of the necessary math etc.).
First, I'm assuming you have a 2D plot but it shouldn't be difficult to generalize to ND case.
Basically, for each point in your experimental data (xi, yi), use your "theoretical curve" to estimate yi' for the value xi. This is probably what Aabaz is referring to by making grid step size the same so that you interpolate the points exactly at the x coordinate values xi of your experimental data using the formula for your curve(s).
Next, to measure whether the fitting is good, you could for e.g. measure the sum of square differences using:
error = sum( (yi' - yi)^2 ){where i range over all points in your exp. data}
Of course other error metrics other than least square could be used to estimate how well the data fit your model (i.e. your curve) but by far for most applications, least square is the most common.
Hope this helps.