Matlab boxplot adjacent values - matlab

I found that calculating an index to specify outliers of a dataset according to how the boxplot works does not give the same results. Please find below an example where I create some data, extract the values from the boxplot (as seen in datatips in the figure window) and compare them to the values I calculated.
While the median and quartiles match up the upper and lower adjacent values do not. According to the Matlab help under 'Whisker', the adjacent values are calculated as q3 + w*(q3-q1) where q3 and q1 are the quantiles and w is the specified whisker length.
Am I calculating this wrong or is there any other mistake? I would like to be able to explain the error.
Screenshot of results table (please note the results vary due to random data)
%Boxplot test
% create random, normally distributed dataset
data = round(randn(1000,1)*10,2);
figure(10)
clf
boxplot(data,'Whisker',1.5)
clear stats tmp
% read data from boxplot, same values as can be seen in datatips in the figure window
h = findobj(gcf,'tag','Median');
tmp = get(h,'YData');
stats(1,1) = tmp(1);
h = findobj(gcf,'tag','Box');
tmp = get(h,'YData');
stats(1,2) = tmp(1);
stats(1,3) = tmp(2);
h = findobj(gcf,'tag','Upper Adjacent Value');
tmp = get(h,'YData');
stats(1,4) = tmp(1);
h = findobj(gcf,'tag','Lower Adjacent Value');
tmp = get(h,'YData');
stats(1,5) = tmp(1);
% calculated data
stats(2,1) = median(data);
stats(2,2) = quantile(data,0.25);
stats(2,3) = quantile(data,0.75);
range = stats(2,3) - stats(2,2);
stats(2,4) = stats(2,3) + 1.5*range;
stats(2,5) = stats(2,2) - 1.5*range;
% error calculation
for k=1:size(stats,2)
stats(3,k) = stats(2,k)-stats(1,k);
end %for k
% convert results to table with labels
T = array2table(stats,'VariableNames',{'Median','P25','P75','Upper','Lower'}, ...
'RowNames',{'Boxplot','Calculation','Error'});

While the calculation of the boundaries, e.g. q3 = q3 + w*(q3-q1), is correct, it is not displayed in the boxplot. What is actually displayed and marked as upper/lower adjacent value is the minimum and maximum of the values within the aforementioned boundaries.
Regarding the initial task leading to the question: For applying the same filtering of outliers as in the boxplot the calculated boundaries can be used.

Related

FInd the intersect between timestamps in Matlab

I want to find the intersect between 2 time vectors t1 and t5 which have some gaps marked by the stars as in the figure. Because intersect function in matlab just find exactly value so I have to use ismembertol. My result is the middle line, which is missing the gap information in the t5 vector. How can I achieve this? This is my code:
`
tol = 1e-08; Fs = 50;
[a,b] = ismembertol(t1,t5,tol);
tcom15 = t1(a);
t1gap = t1(find(round(diff(t1)* 86400*Fs)>1));
t5gap = t5(find(round(diff(t5)* 86400*Fs)>1));
tcom15gap = tcom15(find(round(diff(tcom15)* 86400*Fs)>1));
figure; plot(t1,2*ones(length(t1),1)); hold on
plot(t5,3*ones(length(t5),1));ylim([1 4])
plot(t1gap,2*ones(length(t1gap),1),':*','MarkerSize',5)
plot(t5gap,3*ones(length(t5gap),1),':*','MarkerSize',10)
plot(tcom15,2.5*ones(length(tcom15),1))
plot(tcom15gap,2.5*ones(length(tcom15gap),1),':*','MarkerSize',10)

How to compute confidence intervals and plot them on a bar plot

How can I plot a bar out of a
data = 1x10 cell
, where each value in the cell has a different dimension like 3x100, 3x40, 66x2 etc.
My goal is to get a bar plot, where I would have 10 group of bars and in every group three bars for each of the values. On the bar, I want it to be shown the median of the values, and I want to calculate the confidence interval and show it additionally.
On this example there are not group of bars, but my point is to show you how I want the confidence intervals shown. On the site, where I found this example they offer a solution where they have this command line
e1 = errorbar(mean(data), ci95);
but I have the problem that it can't find any ci95
So, are there any other effective ways to do it, without installing or downloading additional services?
I've found Patrick Happel's answer to not work because the figure window (and therefore the variable b) gets cleared out by subsequent calls to errorbar. Simply adding a hold on command takes care of this. To avoid confusion, here's a new answer that reproduces all of Patrick's original code, plus my small tweak:
%% Old answer
%Just to be safe, let's clear everything
clear all
data = cell(1,10);
% Random length of the data
l = randi(500, 10, 1) + 50;
% Random "width" of the data, with 3 more likely
w = randi(4, 10, 1);
w(w==4) = 3;
% random "direction" of the data
d = randi(2, 10, 1);
% sigma of the data (in fraction of mean)
sigma = rand(10,1) / 3;
% means of the data
dmean = randi(150,10,1);
dsigma = dmean.*sigma;
for c = 1 : 10
if d(c) == 1
data{c} = randn(l(c), w(c)) .* dsigma(c) + dmean(c);
else
data{c} = randn(w(c), l(c)) .* dsigma(c) + dmean(c);
end
end
%============================================
%Next thing is
% On the bar, I want it to be shown the median of the values, and I
% want to calculate the confidence interval and show it additionally.
%
%Are you really sure you want to plot the median? The median of some data
%is not connected to the variance of the data, and hus no type of error
%bars are required. I guess you want to show the mean. If you really want
%to show the median, a box plot might be a better alternative.
%
%The following code computes and plots the mean in a bar plot:
%============================================
means = zeros(numel(data),3);
stds = zeros(numel(data),3);
n = zeros(numel(data),3);
for c = 1:numel(data)
d = data{c};
if size(d,1) < size(d,2)
d = d';
end
cols = size(d,2);
means(c, 1:cols) = nanmean(d);
stds(c, 1:cols) = nanstd(d);
n(c, 1:cols) = sum(~isnan((d)));
end
b = bar(means);
%% New code
%This ensures that b continues to reference existing data in the next for
%loop, as the graphics objects can otherwise be deleted.
hold on
%% Continuing Patrick Happel's answer
%============================================
%Now, we need to compute the length of the error bars. Typical choices are
%the standard deviation of the data (already computed by the code above,
%stored in stds), the standard error or the 95% confidence interval (which
%is the 1.96fold of the standard error, assuming the underlying data
%follows a normal distribution).
%============================================
% for standard deviation use stds
% for standard error
ste = stds./sqrt(n);
% for 95% confidence interval
ci95 = 1.96 * ste;
%============================================
%Last thing is to plot the error bars. Here I chose the ci95 as you asked
%in your question, if you want to change that, simply change the variable
%in the call to errorbar:
%============================================
for c = 1:3
size(means(:, c))
size(b(c).XData)
e = errorbar(b(c).XData + b(c).XOffset, means(:,c), ci95(:, c));
e.LineStyle = 'none';
end
Since I am not sure how your data looks like, since in your question you stated that the elements of the cell contain data with different dimension like
3x100, 3x40, 66x2
I assume that your data can be arranged in columns or rows and that not all data requires three bars.
Since you did not provide a short piece of your data for us to test, I generate some artificial data:
data = cell(1,10);
% Random length of the data
l = randi(500, 10, 1) + 50;
% Random "width" of the data, with 3 more likely
w = randi(4, 10, 1);
w(w==4) = 3;
% random "direction" of the data
d = randi(2, 10, 1);
% sigma of the data (in fraction of mean)
sigma = rand(10,1) / 3;
% means of the data
dmean = randi(150,10,1);
dsigma = dmean.*sigma;
for c = 1 : 10
if d(c) == 1
data{c} = randn(l(c), w(c)) .* dsigma(c) + dmean(c);
else
data{c} = randn(w(c), l(c)) .* dsigma(c) + dmean(c);
end
end
Next thing is
On the bar, I want it to be shown the median of the values, and I want to calculate the confidence interval and show it additionally.
Are you really sure you want to plot the median? The median of some data is not connected to the variance of the data, and hus no type of error bars are required. I guess you want to show the mean. If you really want to show the median, a box plot might be a better alternative.
The following code computes and plots the mean in a bar plot:
means = zeros(numel(data),3);
stds = zeros(numel(data),3);
n = zeros(numel(data),3);
for c = 1:numel(data)
d = data{c};
if size(d,1) < size(d,2)
d = d';
end
cols = size(d,2);
means(c, 1:cols) = nanmean(d);
stds(c, 1:cols) = nanstd(d);
n(c, 1:cols) = sum(~isnan((d)));
end
b = bar(means);
Now, we need to compute the length of the error bars. Typical choices are the standard deviation of the data (already computed by the code above, stored in stds), the standard error or the 95% confidence interval (which is the 1.96fold of the standard error, assuming the underlying data follows a normal distribution).
% for standard deviation use stds
% for standard error
ste = stds./sqrt(n);
% for 95% confidence interval
ci95 = 1.96 * ste;
Last thing is to plot the error bars. Here I chose the ci95 as you asked in your question, if you want to change that, simply change the variable in the call to errorbar:
for c = 1:3
size(means(:, c))
size(b(c).XData)
e = errorbar(b(c).XData + b(c).XOffset, means(:,c), ci95(:, c));
e.LineStyle = 'none';
end

defining the X values for a code

I have this task to create a script that acts similarly to normcdf on matlab.
x=linspace(-5,5,1000); %values for x
p= 1/sqrt(2*pi) * exp((-x.^2)/2); % THE PDF for the standard normal
t=cumtrapz(x,p); % the CDF for the standard normal distribution
plot(x,t); %shows the graph of the CDF
The problem is when the t values are assigned to 1:1000 instead of -5:5 in increments. I want to know how to assign the correct x values, that is -5:5,1000 to the t values output? such as when I do t(n) I get the same result as normcdf(n).
Just to clarify: the problem is I cannot simply say t(-5) and get result =1 as I would in normcdf(1) because the cumtrapz calculated values are assigned to x=1:1000 instead of -5 to 5.
Updated answer
Ok, having read your comment; here is how to do what you want:
x = linspace(-5,5,1000);
p = 1/sqrt(2*pi) * exp((-x.^2)/2);
cdf = cumtrapz(x,p);
q = 3; % Query point
disp(normcdf(q)) % For reference
[~,I] = min(abs(x-q)); % Find closest index
disp(cdf(I)) % Show the value
Sadly, there is no matlab syntax which will do this nicely in one line, but if you abstract finding the closest index into a different function, you can do this:
cdf(findClosest(x,q))
function I = findClosest(x,q)
if q>max(x) || q<min(x)
warning('q outside the range of x');
end
[~,I] = min(abs(x-q));
end
Also; if you are certain that the exact value of the query point q exists in x, you can just do
cdf(x==q);
But beware of floating point errors though. You may think that a certain range outght to contain a certain value, but little did you know it was different by a tiny roundoff erorr. You can see that in action for example here:
x1 = linspace(0,1,1000); % Range
x2 = asin(sin(x1)); % Ought to be the same thing
plot((x1-x2)/eps); grid on; % But they differ by rougly 1 unit of machine precision
Old answer
As far as I can tell, running your code does reproduce the result of normcdf(x) well... If you want to do exactly what normcdf does them use erfc.
close all; clear; clc;
x = linspace(-5,5,1000);
cdf = normcdf(x); % Result of normcdf for comparison
%% 1 Trapezoidal integration of normal pd
p = 1/sqrt(2*pi) * exp((-x.^2)/2);
cdf1 = cumtrapz(x,p);
%% 2 But error function IS the integral of the normal pd
cdf2 = (1+erf(x/sqrt(2)))/2;
%% 3 Or, even better, use the error function complement (works better for large negative x)
cdf3 = erfc(-x/sqrt(2))/2;
fprintf('1: Mean error = %.2d\n',mean(abs(cdf1-cdf)));
fprintf('2: Mean error = %.2d\n',mean(abs(cdf2-cdf)));
fprintf('3: Mean error = %.2d\n',mean(abs(cdf3-cdf)));
plot(x,cdf1,x,cdf2,x,cdf3,x,cdf,'k--');
This gives me
1: Mean error = 7.83e-07
2: Mean error = 1.41e-17
3: Mean error = 00 <- Because that is literally what normcdf is doing
If your goal is not not to use predefined matlab funcitons, but instead to calculate the result numerically (i.e. calculate the error function) then it's an interesting challange which you can read about for example here or in this stats stackexchange post. Just as an example, the following piece of code calculates the error function by implementing eq. 2 form the first link:
nerf = #(x,n) (-1)^n*2/sqrt(pi)*x.^(2*n+1)./factorial(n)/(2*n+1);
figure(1); hold on;
temp = zeros(size(x)); p =[];
for n = 0:20
temp = temp + nerf(x/sqrt(2),n);
if~mod(n,3)
p(end+1) = plot(x,(1+temp)/2);
end
end
ylim([-1,2]);
title('\Sigma_{n=0}^{inf} ( 2/sqrt(pi) ) \times ( (-1)^n x^{2*n+1} ) \div ( n! (2*n+1) )');
p(end+1) = plot(x,cdf,'k--');
legend(p,'n = 0','\Sigma_{n} 0->3','\Sigma_{n} 0->6','\Sigma_{n} 0->9',...
'\Sigma_{n} 0->12','\Sigma_{n} 0->15','\Sigma_{n} 0->18','normcdf(x)',...
'location','southeast');
grid on; box on;
xlabel('x'); ylabel('norm. cdf approximations');
Marcin's answer suggests a way to find the nearest sample point. It is easier, IMO, to interpolate. Given x and t as defined in the question,
interp1(x,t,n)
returns the estimated value of the CDF at x==n, for whatever value of n. But note that, for values outside the computed range, it will extrapolate and produce unreliable values.
You can define an anonymous function that works like normcdf:
my_normcdf = #(n)interp1(x,t,n);
my_normcdf(-5)
Try replacing x with 0.01 when you call cumtrapz. You can either use a vector or a scalar spacing for cumtrapz (https://www.mathworks.com/help/matlab/ref/cumtrapz.html), and this might solve your problem. Also, have you checked the original x-values? Is the problem with linspace (i.e. you are not getting the correct x vector), or with cumtrapz?

Reverse-calculating original data from a known moving average

I'm trying to estimate the (unknown) original datapoints that went into calculating a (known) moving average. However, I do know some of the original datapoints, and I'm not sure how to use that information.
I am using the method given in the answers here: https://stats.stackexchange.com/questions/67907/extract-data-points-from-moving-average, but in MATLAB (my code below). This method works quite well for large numbers of data points (>1000), but less well with fewer data points, as you'd expect.
window = 3;
datapoints = 150;
data = 3*rand(1,datapoints)+50;
moving_averages = [];
for i = window:size(data,2)
moving_averages(i) = mean(data(i+1-window:i));
end
length = size(moving_averages,2)+(window-1);
a = (tril(ones(length,length),window-1) - tril(ones(length,length),-1))/window;
a = a(1:length-(window-1),:);
ai = pinv(a);
daily = mtimes(ai,moving_averages');
x = 1:size(data,2);
figure(1)
hold on
plot(x,data,'Color','b');
plot(x(window:end),moving_averages(window:end),'Linewidth',2,'Color','r');
plot(x,daily(window:end),'Color','g');
hold off
axis([0 size(x,2) min(daily(window:end))-1 max(daily(window:end))+1])
legend('original data','moving average','back-calculated')
Now, say I know a smattering of the original data points. I'm having trouble figuring how might I use that information to more accurately calculate the rest. Thank you for any assistance.
You should be able to calculate the original data exactly if you at any time can exactly determine one window's worth of data, i.e. in this case n-1 samples in a window of length n. (In your case) if you know A,B and (A+B+C)/3, you can solve now and know C. Now when you have (B+C+D)/3 (your moving average) you can exactly solve for D. Rinse and repeat. This logic works going backwards too.
Here is an example with the same idea:
% the actual vector of values
a = cumsum(rand(150,1) - 0.5);
% compute moving average
win = 3; % sliding window length
idx = hankel(1:win, win:numel(a));
m = mean(a(idx));
% coefficient matrix: m(i) = sum(a(i:i+win-1))/win
A = repmat([ones(1,win) zeros(1,numel(a)-win)], numel(a)-win+1, 1);
for i=2:size(A,1)
A(i,:) = circshift(A(i-1,:), [0 1]);
end
A = A / win;
% solve linear system
%x = A \ m(:);
x = pinv(A) * m(:);
% plot and compare
subplot(211), plot(1:numel(a),a, 1:numel(m),m)
legend({'original','moving average'})
title(sprintf('length = %d, window = %d',numel(a),win))
subplot(212), plot(1:numel(a),a, 1:numel(a),x)
legend({'original','reconstructed'})
title(sprintf('error = %f',norm(x(:)-a(:))))
You can see the reconstruction error is very small, even using the data sizes in your example (150 samples with a 3-samples moving average).

Matlab fourier descriptors what's wrong?

I am using Gonzalez frdescp function to get Fourier descriptors of a boundary. I use this code, and I get two totally different sets of numbers describing two identical but different in scale shapes.
So what is wrong?
im = imread('c:\classes\a1.png');
im = im2bw(im);
b = bwboundaries(im);
f = frdescp(b{1}); // fourier descriptors for the boundary of the first object ( my pic only contains one object anyway )
// Normalization
f = f(2:20); // getting the first 20 & deleting the dc component
f = abs(f) ;
f = f/f(1);
Why do I get different descriptors for identical - but different in scale - two circles?
The problem is that the frdescp code (I used this code, that should be the same as referred by you) is written also in order to center the Fourier descriptors.
If you want to describe your shape in a correct way, it is mandatory to mantain some descriptors that are symmetric with respect to the one representing the DC component.
The following image summarize the concept:
In order to solve your problem (and others like yours), I wrote the following two functions:
function descriptors = fourierdescriptor( boundary )
%I assume that the boundary is a N x 2 matrix
%Also, N must be an even number
np = size(boundary, 1);
s = boundary(:, 1) + i*boundary(:, 2);
descriptors = fft(s);
descriptors = [descriptors((1+(np/2)):end); descriptors(1:np/2)];
end
function significativedescriptors = getsignificativedescriptors( alldescriptors, num )
%num is the number of significative descriptors (in your example, is was 20)
%In the following, I assume that num and size(alldescriptors,1) are even numbers
dim = size(alldescriptors, 1);
if num >= dim
significativedescriptors = alldescriptors;
else
a = (dim/2 - num/2) + 1;
b = dim/2 + num/2;
significativedescriptors = alldescriptors(a : b);
end
end
Know, you can use the above functions as follows:
im = imread('test.jpg');
im = im2bw(im);
b = bwboundaries(im);
b = b{1};
%force the number of boundary points to be even
if mod(size(b,1), 2) ~= 0
b = [b; b(end, :)];
end
%define the number of significative descriptors I want to extract (it must be even)
numdescr = 20;
%Now, you can extract all fourier descriptors...
f = fourierdescriptor(b);
%...and get only the most significative:
f_sign = getsignificativedescriptors(f, numdescr);
I just went through the same problem with you.
According to this link, if you want invariant to scaling, make the comparison ratio-like, for example by dividing every Fourier coefficient by the DC-coefficient. f*1 = f1/f[0], f*[2]/f[0], and so on. Thus, you need to use the DC-coefficient where the f(1) in your code is not the actual DC-coefficient after your step "f = f(2:20); % getting the first 20 & deleting the dc component". I think the problem can be solved by keeping the value of the DC-coefficient first, the code after adjusted should be like follows:
% Normalization
DC = f(1);
f = f(2:20); % getting the first 20 & deleting the dc component
f = abs(f) ; % use magnitudes to be invariant to translation & rotation
f = f/DC; % divide the Fourier coefficients by the DC-coefficient to be invariant to scale