I have a list of data that I am trying to fit to a polynomial and I am trying to plot the 95% confidence bands for the parameters as well (in Matlab).
If my data are x and y
f=fit(x,y,'poly2')
plot(f,x,y)
ci=confint(f,0.95);
a_ci=ci(1,:);
b_ci=ci(2,:);
I do not know how to proceed after that to get the minimum and maximum band around my data. Does anyone know how to do that?
I can see that you have the curve fitting toolbox installed, which is good, because you need it for the following code to work.
Basic fit of example data
Let's define some example data and a possible fit function. (I could also have used poly2 here, but I wanted to keep it a bit more general.)
xdata = (0:0.1:1)'; % column vector!
noise = 0.1*randn(size(xdata));
ydata = xdata.^2 + noise;
f = fittype('a*x.^2 + b');
fit1 = fit(xdata, ydata, f, 'StartPoint', [1,1])
plot(fit1, xdata, ydata)
Side note: plot() is not our usual plot function, but a method of the cfit-object fit1.
Confidence intervals of the fitted parameters
Our fit uses the data to determine the coefficients a,b of the underlying model f(x)=ax2+b. You already did this, but for completeness here is how you can read out the uncertainty of the coefficients for any confidence interval. The coefficients are alphabetically ordered, which is why I can use ci(1,:) for a, and so on.
names = coeffnames(fit1) % check the coefficient order!
ci = confint(fit1, 0.95); % 2 sigma interval
a_ci = ci(1,:)
b_ci = ci(2,:)
By default, Matlab uses 2σ (0.95) confidence intervals. Some people (physicists) prefer to quote the 1σ (0.68) intervals.
Confidence and Prediction Bands
It's a good habit to plot confidence bands or prediction bands around the data – especially when the coefficients are correlated! But you should take a moment to think about which one of the two you want to plot:
Prediction band: If I take a new measurement value, where would I expect it to lie? In Matlab terms, this is called the “observation band”.
Confidence band: Where do I expect the true value to lie? In Matlab terms, this is called the “functional band”.
As with the coefficient’s confidence intervals, Matlab uses 2σ bands by default, and the physicists among us switch this to 1σ intervals. By its nature, the prediction band is bigger, because it is the combination of the error of the model (the confidence band!) and the error of the measurement.
There is a another destinction to make, one that I don’t fully understand. Both Matlab and Wikipedia make that distinction.
Pointwise: How big is the prediction/confidence band for a single measurement/true value? In virtually all cases I can think of, this is what you would want to ask as a physicist.
Simultaneous: How big do you have to make the prediction/confidence band if you want a set of all new measurements/all prediction points to lie within the band with a given confidence?
In my personal opinion, the “simultaneous band” is not a band! For a measurement with n points, it should be n individual error bars!
The prediction/confidence distinction and the pointwise/simultaneous distinction give you a total of four options for “the” band around the plot. Matlab makes the 2σ pointwise prediction band easily accessible, but what you seem to be interested in is the 2σ pointwise confidence band. It is a bit more cumbersome to plot, because you have to specify dummy data over which to evaluate the prediction band:
x_dummy = linspace(min(xdata), max(xdata), 100);
figure(1); clf(1);
hold all
plot(xdata,ydata,'.')
plot(fit1) % by default, evaluates the fit over the currnet XLim
% use "functional" (confidence!) band; use "simultaneous"=off
conf1 = predint(fit1,x_dummy,0.95,'functional','off');
plot(x_dummy, conf1, 'r--')
hold off
Note that the confidence band at x=0 equals the confidence interval of the fit-coefficient b!
Extrapolation
If you want to extrapolate to x-values that are not covered by the range of your data, you can evaluate the fit and the prediction/confidence band for a bigger range:
x_range = [0, 2];
x_dummy = linspace(x_range(1), x_range(2), 100);
figure(1); clf(1);
hold all
plot(xdata,ydata,'.')
xlim(x_range)
plot(fit1)
conf1 = predint(fit1,x_dummy,0.68,'functional','off');
plot(x_dummy, conf1, 'r--')
hold off
Related
I am trying to use the best practice techniques in the Computational fluid dynamics area to analyze and visualize a velocity field.
Given 6 arrays of moving particles' positions and velocities: x,y,z and vx,vy,vz respectively.
I want to visualize and calculate the induced velocity field and its properties such as: curl, divergence, isosurfaces etc.
Here is a modest script of the volume visualization functions I was able to use without calling meshgrid (to avoid interpolation and more noise).
Ultimately, one of the things that I am not sure about is how to wisely create a mesh grid from my 50 points in space, the second is how to use CFD approaches to visualize the velocity field regardless the small amount of data points.
close all
rng default
t=0.1:0.1:10;
x = sin(t)';
y = cos(t)';
z = t.^0.2';
vx=y;vy=x;vz=z;
figure
subplot(2,3,1);
quiver3(x,y,z,vx,vy,vz);
hold on
streamribbon({ [x y z] }, {vx},{vy},{vz});
subplot(2,3,2);
[curl_val, cav] = curl([x,y,z],[vx,vy,vz]);
surfc([x,y,z],cav);
subplot(2,3,3);
surfc([x,y,z],curl_val);
w = sqrt( vx.^2 + vy.^2 + vz.^2 );
subplot(2,3,4);
quiver3(x,y,z,vx,vy,vz);
streamtube({ [x y z] }, {w});
subplot(2,3,5);
quiver3(x,y,z,vx,vy,vz);
subplot(2,3,6);
surfc([x,y,z],[vx,vy,vz]);
When I run the above script (excluding the data generation) on a real data, I get the following plots which aren't very informative:
I strongly suspect that the problem here is with the data, not the visualization technique. But in general, the problem is one or more of the following:
1) You do not have enough data to capture the underlying dynamics (the dynamics in space operate at a higher spatial frequency than you sampled)
2) The data is too noisy for the number of datapoints your collected.
3) The flow is fundamentally turbulent, and hence hoping for a nice laminar-like plot is not going to happen.
When you have problems visualizing data, the first rule of thumb is always to throw away any visualization that attempts to approximate a derivative (or gradient) in any way. The reason is that when you try to approximate a derivative with real data, noise almost always makes that estimate nonsense. For example, let's suppose we have a cosine that gets corrupted by some noise, and we try to numerically estimate the derivative from the data
figure
% Create a signal
dt = .1;
t = 0:.1:10;
x = cos(t);
% Add some noise
y = x + .5 * randn(size(x));
% Compute the first order approximation of the derivatives of the signals
dx = diff(x)/dt;
dy = diff(y)/dt;
% Plot everything
subplot(2,1,1)
plot(t,x,t,y)
axis tight
subplot(2,1,2)
plot(t(2:end),dx,t(2:end),dy)
axis tight
In the first plot, which shows the raw data, the noise doesn't look to bad, but when we look at the derivative estimate! Ouch... The noise is really amplified. So forget about higher order properties of a flow, such as the curl and vorticity, which require gradients of the data.
So what can we do in cases like these? Well essentially, just look at the raw data. If there is a pattern, it will reveal itself. For instance, let's look at your raw velocity vectors from 3 different perspectives:
data = dlmread('data.csv','\s')
x = data(:,1);
y = data(:,2);
z = data(:,3);
vx = data(:,4);
vy = data(:,5);
vz = data(:,6);
close all
figure
subplot(1,3,1);
quiver3(x,y,z,vx,vy,vz);
view([1,0,0])
subplot(1,3,2);
quiver3(x,y,z,vx,vy,vz);
view([0,1,0])
subplot(1,3,3);
quiver3(x,y,z,vx,vy,vz);
view([0,0,1])
The only thing that looks even slightly structured is that last plot. However, that plot tells us that we probably also have turbulence (in addition to noise) to contend with.
Specifically, from view 3, it definitely seems like you are taking velocity measurements in a flow that is tightly hugging an object. In this case, your measurements are probably too tight though... and probably in the boundary layer. If that is the case (that the measurements are in the boundary layer), then you can get time-varying effects in the flow, meaning that it doesn't make sense to look at anything without also having a time component. The "nice" plots that you have in your answer are only really helpful when the flow is laminar, where we get to see these nice, consistent stream lines. If it is turbulent, then there is no discernable pattern in the flow, no matter how hard you look.
So in conclusion, I don't think you will be able to find a nice visualization for your data because either the sensors you used were too noisy, or the flow was too turbulent.
As an aside... consider what happens when we look at the raw velocity vectors from your "nice" dataset:
That, my friend, is a well-trained house pet. You have a wild mountain lion on your hands.
Let's say we have the following data:
A1= [41.3251
18.2350
9.9891
36.1722
50.8702
32.1519
44.6284
60.0892
58.1297
34.7482
34.6447
6.7361
1.2960
1.9778
2.0422];
A2=[86.3924
86.4882
86.1717
85.8506
85.8634
86.1267
86.4304
86.6406
86.5022
86.1384
86.5500
86.2765
86.7044
86.8075
86.9007];
When I plot the above data using plot(A1,A2);, I get this graph:
Is there any way to make the graph look smooth like a cubic plot?
Yes you can. You can interpolate in between the keypoints. This will require a bit of trickery though. Blindly using interpolation with any of MATLAB's commands won't work because they require that the independent axes (the x-axis in your case) to increase. You can't do this with your data currently... at least out of the box. Therefore you'll have to create a dummy list of values that span from 1 up to as many elements as there are in A1 (or A2 as they're both equal in size) to create an independent axis and interpolate both arrays independently by specifying the dummy list with a finer spacing in resolution. This finer spacing is controlled by the total number of new points you want to introduce in the plot. These points will be defined within the range of the dummy list but the spacing in between each point will decrease as you increase the total number of new points. As a general rule, the more points you add the less spacing there will be and so the plot should be more smooth. Once you do that, plot the final values together.
Here's some code for you to run. We will be using interp1 to perform the interpolation for us and most of the work. The function linspace creates the finer grid of points in the dummy list to facilitate the interpolation. N would be the total number of desired points you want to plot. I've made it 500 for now meaning that 500 points will be used for interpolation using your original data. Experiment by increasing (or decreasing) the total number of points and seeing what effect this has in the smoothness of your data.
I'll also be using the Piecewise Cubic Hermite Interpolating Polynomial or pchip as the method of interpolation, which is basically cubic spline interpolation if you want to get technical. Assuming that A1 and A2 are already created:
%// Specify number of interpolating points
N = 500;
%// Specify dummy list of points
D = 1 : numel(A1);
%// Generate finer grid of points
NN = linspace(1, numel(A1), N);
%// Interpolate each set of points independently
A1interp = interp1(D, A1, NN, 'pchip');
A2interp = interp1(D, A2, NN, 'pchip');
%// Plot the data
plot(A1interp, A2interp);
I now get the following:
Let's say we have the following data:
A1= [41.3251
18.2350
9.9891
36.1722
50.8702
32.1519
44.6284
60.0892
58.1297
34.7482
34.6447
6.7361
1.2960
1.9778
2.0422];
A2=[86.3924
86.4882
86.1717
85.8506
85.8634
86.1267
86.4304
86.6406
86.5022
86.1384
86.5500
86.2765
86.7044
86.8075
86.9007];
When I plot the above data using plot(A1,A2);, I get this graph:
Is there any way to make the graph look smooth like a cubic plot?
Yes you can. You can interpolate in between the keypoints. This will require a bit of trickery though. Blindly using interpolation with any of MATLAB's commands won't work because they require that the independent axes (the x-axis in your case) to increase. You can't do this with your data currently... at least out of the box. Therefore you'll have to create a dummy list of values that span from 1 up to as many elements as there are in A1 (or A2 as they're both equal in size) to create an independent axis and interpolate both arrays independently by specifying the dummy list with a finer spacing in resolution. This finer spacing is controlled by the total number of new points you want to introduce in the plot. These points will be defined within the range of the dummy list but the spacing in between each point will decrease as you increase the total number of new points. As a general rule, the more points you add the less spacing there will be and so the plot should be more smooth. Once you do that, plot the final values together.
Here's some code for you to run. We will be using interp1 to perform the interpolation for us and most of the work. The function linspace creates the finer grid of points in the dummy list to facilitate the interpolation. N would be the total number of desired points you want to plot. I've made it 500 for now meaning that 500 points will be used for interpolation using your original data. Experiment by increasing (or decreasing) the total number of points and seeing what effect this has in the smoothness of your data.
I'll also be using the Piecewise Cubic Hermite Interpolating Polynomial or pchip as the method of interpolation, which is basically cubic spline interpolation if you want to get technical. Assuming that A1 and A2 are already created:
%// Specify number of interpolating points
N = 500;
%// Specify dummy list of points
D = 1 : numel(A1);
%// Generate finer grid of points
NN = linspace(1, numel(A1), N);
%// Interpolate each set of points independently
A1interp = interp1(D, A1, NN, 'pchip');
A2interp = interp1(D, A2, NN, 'pchip');
%// Plot the data
plot(A1interp, A2interp);
I now get the following:
I have a Matlab figure I want to use in a paper. This figure contains multiple cdfplots.
Now the problem is that I cannot use the markers because the become very dense in the plot.
If i want to make the samples sparse I have to drop some samples from the cdfplot which will result in a different cdfplot line.
How can I add enough markers while maintaining the actual line?
One method is to get XData/YData properties from your curves follow solution (1) from #ephsmith and set it back. Here is an example for one curve.
y = evrnd(0,3,100,1); %# random data
%# original data
subplot(1,2,1)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
%# reduced data
subplot(1,2,2)
h = cdfplot(y);
set(h,'Marker','*','MarkerSize',8,'MarkerEdgeColor','r','LineStyle','none')
xdata = get(h,'XData');
ydata = get(h,'YData');
set(h,'XData',xdata(1:5:end));
set(h,'YData',ydata(1:5:end));
Another method is to calculate empirical CDF separately using ECDF function, then reduce the results before plotting with PLOT.
y = evrnd(0,3,100,1); %# random data
[f, x] = ecdf(y);
%# original data
subplot(1,2,1)
plot(x,f,'*')
%# reduced data
subplot(1,2,2)
plot(x(1:5:end),f(1:5:end),'r*')
Result
I know this is potentially unnecessary given MATLAB's built-in functions (in the Statistics Toolbox anyway) but it may be of use to other viewers who do not have access to the toolbox.
The empirical CMF (CDF) is essentially the cumulative sum of the empirical PMF. The latter is attainable in MATLAB via the hist function. In order to get a nice approximation to the empirical PMF, the number of bins must be selected appropriately. In the following example, I assume that 64 bins is good enough for your data.
%# compute a histogram with 64 bins for the data points stored in y
[f,x]=hist(y,64);
%# convert the frequency points in f to proportions
f = f./sum(f);
%# compute the cumulative sum of the empirical PMF
cmf = cumsum(f);
Now you can choose how many points you'd like to plot by using the reduced data example given by yuk.
n=20 ; % number of total data markers in the curve graph
M_n = round(linspace(1,numel(y),n)) ; % indices of markers
% plot the whole line, and markers for selected data points
plot(x,y,'b-',y(M_n),y(M_n),'rs')
verry simple.....
try reducing the marker size.
x = rand(10000,1);
y = x + rand(10000,1);
plot(x,y,'b.','markersize',1);
For publishing purposes I tend to use the plot tools on the figure window. This allow you to tweak all of the plot parameters and immediately see the result.
If the problem is that you have too many data points, you can:
1). Plot using every nth sample of the data. Experiment to find an n that results in the look you want.
2). I typically fit curves to my data and add a few sparsely placed markers to plots of the fits to differentiate the curves.
Honestly, for publishing purposes I have always found that choosing different 'LineStyle' or 'LineWidth' properties for the lines gives much cleaner results than using different markers. This would also be a lot easier than trying to downsample your data, and for plots made with CDFPLOT I find that markers simply occlude the stairstep nature of the lines.
First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))
You might consider, also, using bootstrapping, with the bootci function.
You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm