Fit a line pattern on curve with unknown number of points - linear-regression

I've got a sample curve which ends theoretically with decreasing exponential. The curve end falls into noise. The sample points are given in log scale.
What I want to do, is to find and fit the linear part of the log curve to retrieve the exponential factor. The trick is that I don't know the starting point neither the ending point of the linear part of the log curve.
The strategy I'm using is to fit a line at each point with at least say 20 points and until the end of the curve is reached. Then from all these regressions, I keep the one with the best determination coefficient.
I did several tests and found that the RMS error increases with the number of points systematically as the curve becomes more and more noisy, so the extracted factor is always computed on the minimum number of samples (20 in this example case).
My question is, is there a more efficient method to compute this factor ? Does increasing the minimum number of samples improve the accuracy of the fit ?
Here a sample curve I want to fit
The computed slope regarding the position in the curve
And the corresponding RMS error

Related

How to find relationship between two distribution curves

I have some floating data (represented by blue curve), when I do some loss compression, the yellow curve can be obtained (mean,standard deviation).
My aim is to minimize this losses after compression process, Hence, I would like to find an equation/curve/filter that:
the yellow curve times "function" nearly equal to blue Gaussian curve.
or
blue curve = Function(green curve)
Thanks for your help!
The best way is to do Kolmogorov–Smirnov test. It compares the maximum difference between the cumulative distributions of the two input vectors.
You can start to play with this test using the implementation in Matlab called [h p k]=kstest2(dist1, dist2) You should be looking at the k value which is the test statistic, it denotes the maximum difference between the 2 empirical cumulative distributions. If you want to visualise how is this difference calculated,
cdfplot(dist1)
hold on
cdfplot(dist2)
hold off
you will see the two cumulative distributions in the same plot. The maximum differene between them is k. If the relationship between the 2 distributions are high the lower the gap is and k value tends to be 1 and in case of highly different distributions the value moves towards 0 and away from 1.
Hope it helps.
If you have found any more interesting methods, kindly let me know.

What is a spectrogram and how do I set its parameters?

I am trying to plot the spectrogram of my time domain signal given:
N=5000;
phi = (rand(1,N)-0.5)*pi;
a = tan((0.5.*phi));
i = 2.*a./(1-a.^2);
plot(i);
spectrogram(i,100,1,100,1e3);
The problem is I don't understand the parameters and what values should be given. These values that I am using, I referred to MATLAB's online documentation of spectrogram. I am new to MATLAB, and I am just not getting the idea. Any help will be greatly appreciated!
Before we actually go into what that MATLAB command does, you probably want to know what a spectrogram is. That way you'll get more meaning into how each parameter works.
A spectrogram is a visual representation of the Short-Time Fourier Transform. Think of this as taking chunks of an input signal and applying a local Fourier Transform on each chunk. Each chunk has a specified width and you apply a Fourier Transform to this chunk. You should take note that each chunk has an associated frequency distribution. For each chunk that is centred at a specific time point in your time signal, you get a bunch of frequency components. The collection of all of these frequency components at each chunk and plotted all together is what is essentially a spectrogram.
The spectrogram is a 2D visual heat map where the horizontal axis represents the time of the signal and the vertical axis represents the frequency axis. What is visualized is an image where darker colours means that for a particular time point and a particular frequency, the lower in magnitude the frequency component is, the darker the colour. Similarly, the higher in magnitude the frequency component is, the lighter the colour.
Here's one perfect example of a spectrogram:
Source: Wikipedia
Therefore, for each time point, we see a distribution of frequency components. Think of each column as the frequency decomposition of a chunk centred at this time point. For each column, we see a varying spectrum of colours. The darker the colour is, the lower the magnitude component at that frequency is and vice-versa.
So!... now you're armed with that, let's go into how MATLAB works in terms of the function and its parameters. The way you are calling spectrogram conforms to this version of the function:
spectrogram(x,window,noverlap,nfft,fs)
Let's go through each parameter one by one so you can get a greater understanding of what each does:
x - This is the input time-domain signal you wish to find the spectrogram of. It can't get much simpler than that. In your case, the signal you want to find the spectrogram of is defined in the following code:
N=5000;
phi = (rand(1,N)-0.5)*pi;
a = tan((0.5.*phi));
i = 2.*a./(1-a.^2);
Here, i is the signal you want to find the spectrogram of.
window - If you recall, we decompose the image into chunks, and each chunk has a specified width. window defines the width of each chunk in terms of samples. As this is a discrete-time signal, you know that this signal was sampled with a particular sampling frequency and sampling period. You can determine how large the window is in terms of samples by:
window_samples = window_time/Ts
Ts is the sampling time of your signal. Setting the window size is actually very empirical and requires a lot of experimentation. Basically, the larger the window size, the better frequency resolution you get as you're capturing more of the frequencies, but the time localization is poor. Similarly, the smaller the window size, the better localization you have in time, but you don't get that great of a frequency decomposition. I don't have any suggestions here on what the most optimal size is... which is why wavelets are preferred when it comes to time-frequency decomposition. For each "chunk", the chunks get decomposed into smaller chunks of a dynamic width so you get a mixture of good time and frequency localization.
noverlap - Another way to ensure good frequency localization is that the chunks are overlapping. A proper spectrogram ensures that each chunk has a certain number of samples that are overlapping for each chunk and noverlap defines how many samples are overlapped in each window. The default is 50% of the width of each chunk.
nfft - You are essentially taking the FFT of each chunk. nfft tells you how many FFT points are desired to be computed per chunk. The default number of points is the largest of either 256, or floor(log2(N)) where N is the length of the signal. nfft also gives a measure of how fine-grained the frequency resolution will be. A higher number of FFT points would give higher frequency resolution and thus showing fine-grained details along the frequency axis of the spectrogram if visualised.
fs - The sampling frequency of your signal. The default is 1 Hz, but you can override this to whatever the sampling frequency your signal is at.
Therefore, what you should probably take out of this is that I can't really tell you how to set the parameters. It all depends on what signal you have, but hopefully the above explanation will give you a better idea of how to set the parameters.
Good luck!

why is perfcurve() matlab function giving me straight lines and not a normal curve as expected?

I am trying to build a receiver operating characteristic (ROC) curves to evaluate the discriminating ability of my classifier to correctly classify diseased and non-diseased subjects.
I understand that the closer the curve follows the left-hand border and then the top border of the ROC space, the more accurate the test. My experiments gave me quite desirable value of area under curve (auc), i.e. 0.86458. However, the ROC curve (in which I included the cut-off points for tracing purposes) seems quite strange as it gave me straight lines as below:
... and not a curve I expected and as I normally see from any references like this:
Does it hav something to do with the number of observations used? (in this case I only have 50 samples). Or is this just fine as long as the the auc value is high and that the 'curve' comes above the 45-degree diagonal of the ROC space? I would be glad if someone can share their thoughts about it. Thank you!
By the way, I used the perfcurve() function in matlab:
% ROC comparison between the proposed approach and the baseline
[X1,Y1,T1,auc1,OPTROCPT1,SUBY5,SUBYNAMES1] = perfcurve(testLabel,predlabel_prop,1);
[X2,Y2,T2,auc2,OPTROCPT2,SUBY2,SUBYNAMES2] = perfcurve(testLabel,predLabel_base,1);
figure;
plot(X1,Y1,'-r*',X2,Y2,'--ko');
legend('proposed approach','baseline','Location','east');
xlabel('False positive rate'); ylabel('True positive rate')
title('ROC comparison of the proposed approach and the baseline')
text(0.6,0.3,{'* - proposed method',strcat('Area Under Curve = ',...
num2str(auc1))},'EdgeColor','r');
text(0.6,0.15,{'o - baseline',strcat('Area Under Curve = ',num2str(auc2))},'EdgeColor','k');
You probably have too litte data.
You curve indicates your data set has 13 negative and 5 positive examples (in your test set?)
Furthermore, all but 4 have exactly the same score (maybe 0)? Or is that your cutoff?
Given this small sample size, I would not accept the hypothesis that your proposed method is better than the baseline, but accept the alternative - the methods perform as good as the other: the difference of 0.04 is much too small for this tiny sample size, the results are virtually identical. Any variation within the cut-off area (the diagonal part) can be much larger than this 0.04... On a different run, a different test set, the results may be the other way around.
Shape of your curve is just a result of high explanatory power of your model and limited number of observations (e.g. take a look at the example here http://nl.mathworks.com/help/stats/perfcurve.html).

How do I find the slope (rate) in MATLAB?

How do I find the slope (rate) in MATLAB?
For example, say I have a scatter plot:
Year = [2001 2002 2003 2004 2005];
Distance = [1.5 1.8 1.9 2.2 2.5];
scatter(Year, Distance)
hold on
pf = polyfit(Year,Distance,1);
f = polyval(pf,Year);
plot(Year,f)
And I can find R by:
[r,p] = corrcoef(Year,Distance)
I want to find the rate at which the distance increases per year, which I think is equivalent to the slope?
You are correct in your interpretation of the slope in this case. If you use polyfit in that fashion, you are finding the slope and intercept of the regression line that best fits that distribution. In this case, the slope would be the rate at which distance increases per year. Without going into much detail, polyfit will determine the line of best fit that will minimize the sum of squared errors between the best fit line and your data points. Therefore, this slope will give you the best rate at which distance is increasing per year, given your point distribution.
You can follow Chris A's approach in that you can find point-wise pairs of neighbouring points and compute a slope for each, then do an average, but doing polyfit will find the least squares regression line and in my opinion that's the way to go.
You can obtain the least squares, or best fit slope by extracting the first value of pf as you have already observed. The second value will contain the intercept term of the regression line.
Good choice on using corrcoef to determine how good the fit is. However, be careful and take the correlation coefficient with a grain of salt. Some distributions may report a good correlation coefficient, but the actual best fit line will not look very good. A classic example would be the Anscombe quartet. In this example, all distributions reported a correlation coefficient of 0.816, yet the variability in the data was quite different. As a means of self-containment, this is what the data look like as well as the best fit line through each set of points. You can see that the regression line is actually the same for all data sets, yet the point distribution is completely different:

How to calculate residuals for two curves (matrixes) of different size?

I've got a theoretical curve which was calculated numerically and an experimental curve (better to say a massive of experimental points). I need to calculate the residuals between these two curves to check the accuracy of modeling with the least squares sum method. These matrixes (curves) are of different size. Is there any function in MATLAB providing the calculation of residuals for two matrixes of different size?
I thought I'll just elaborate a bit on what Aabaz said in case there are others who might find this useful (Although Aabaz's explanation is probably clear enough for people who have an understanding of the necessary math etc.).
First, I'm assuming you have a 2D plot but it shouldn't be difficult to generalize to ND case.
Basically, for each point in your experimental data (xi, yi), use your "theoretical curve" to estimate yi' for the value xi. This is probably what Aabaz is referring to by making grid step size the same so that you interpolate the points exactly at the x coordinate values xi of your experimental data using the formula for your curve(s).
Next, to measure whether the fitting is good, you could for e.g. measure the sum of square differences using:
error = sum( (yi' - yi)^2 ){where i range over all points in your exp. data}
Of course other error metrics other than least square could be used to estimate how well the data fit your model (i.e. your curve) but by far for most applications, least square is the most common.
Hope this helps.