How to remove the outliers located outside the predication bound in Matlab? - matlab

Hey guys I have one question related to processing of Time series, I have xy data and want to remove the outliers, so i defined it by ones that located outside the the prediction bound, I applied the regress functions [B, Bint, R, Rint, stats] = regress(y, x);but iam confused how to remove that ones?
any help??

Straight from the docs
[b,bint,r,rint] = regress(y,X) returns an n-by-2 matrix rint of
intervals that can be used to diagnose outliers. If the interval
rint(i,:) for observation i does not contain zero, the corresponding
residual is larger than expected in 95% of new observations,
suggesting an outlier.
Therefore, to find the location of outliers in your data, it should be just:
n = rint(:,1)>0|rint(:,2)<0;
Then you can either remove them, plot them in a different colour, or whatever.

Related

MATLAB - Changepoint analysis or "findchangepts": How does it work?

I am using the function findchangepts and use 'linear' which detects changes in mean and slope. How does it note a change? Is it by consecutive points until the next point has a different mean and slope?
Mathworks has the following explanation:
If x is a vector with N elements, then findchangepts partitions x into two regions, x(1:ipt-1) and x(ipt:N), that minimize the sum of the residual (squared) error of each region from its local mean.
How does the function get ipt?
Thanks in advance!
I am working with a single vector with N elements.

How to apply a moving median filter on a time series of 2D scans in Matlab?

I have a huge set of data of a timelapse of 2D laser scans of waves running up and down stairs (see fig.1fig.2fig.3).
There is a lot of noise in the scans, since the water splashes a lot.
Now I want to smoothen the scans.
I have 2 questions:
How do I apply a moving median filter (as recommended by another study dealing with a similar problem)? I can only find instructions for single e.g. (x,y) or (t,y) plots but not for x and y values that vary over time. Maybe an average filter would do it as well, but I do not have a clue on that either.
The scanner is at a fixed point (222m) so all the data spikes point towards that point at the ceiling. Is it possible or necessary to include this into the smoothing process?
This is the part of the code (I hope it's enough to get it):
% Plot data as real time profile
x1=data.x;y1=data.y;
t=data.t;
% add moving median filter here?
h1=plot(x1(1,:),y1(1,:));
axis([210 235 3 9])
ht=title('Scanner data');
for i=1:1:length(t);
set(h1,'XData',x1(i,:),'YData',y1(i,:));set(ht,'String',sprintf('t = %5.2f
s',data.t(i)));pause(.01);end
The data.x values are stored in a (mxn) matrix in which the change in time is arranged vertically and the x values i.e. "laser points" of the scanner are horizontally arranged. The data.y is stored in the same way. The data.t values are stored in a (mx1) matrix.
I hope I explained everything clearly and that somebody can help me. I am already pretty desperate about it... If there is anything missing or confusing, please let me know.
If you're trying to apply a median filter in the x-y plane, then consider using medfilt2 from the Image Processing Toolbox. Note that this function only accepts 2-D inputs, so you'll have to loop over the third dimension.
Also note that medfilt2 assumes that the x and y data are uniformly spaced, so if your x and y data don't fall onto a uniformly spaced grid you may have to manually loop over indices, extract the corresponding patches, and compute the median.
If you can/want to apply an averaging filter instead of a median filter, and if you have uniformly spaced data, then you can use convn to compute a k x k moving average by doing:
y = convn(x, ones(k,k)/(k*k), 'same');
Note that you'll get some bias on the boundaries because you're technically trying to compute an average of k^2 pixels when you have less than that number of values available.
Alternatively, you can use nested calls to movmean since the averaging operation is separable:
y = movmean(movmean(x, k, 2), k, 1);
If your grid is separable, but not uniform, you can still use movmean, just use the SamplePoints name-value pair:
y = movmean(movmean(x, k, 2, 'SamplePoints', yv), k, 1, 'SamplePoints', xv);
You can also control the endpoint handling in movmean with the Endpoints name-value pair.

Kullback Leibler Divergence of 2 Histograms in MatLab

I would like a function to calculate the KL distance between two histograms in MatLab. I tried this code:
http://www.mathworks.com/matlabcentral/fileexchange/13089-kldiv
However, it says that I should have two distributions P and Q of sizes n x nbins. However, I am having trouble understanding how the author of the package wants me to arrange the histograms. I thought that providing the discretized values of the random variable together with the number of bins would suffice (I would assume the algorithm would use an arbitrary support to evaluate the expectations).
Any help is appreciated.
Thanks.
The function you link to requires that the two histograms passed be aligned and thus have the same length NBIN x N (not N X NBIN), that is, if N>1 then the number of rows in the inputs should be equal to the number of bins in the histograms. If you are just going to compare two histograms (that is if N=1) it doesn't really matter, you can pass either row or column vector versions of these as long as you are consistent and the order of bins matches.
A generic call to the function looks like this:
dists = kldiv(bins,P,Q)
The implementation allows comparison of multiple histograms to each other (that is, N>1), in which case pairs of columns (with matching column index) in each array are compared and the result is a row vector with distances for each matching pair.
Array bins should be the same size as P and Q and is used to perform a very minimal check that the inputs are of the same size, but is not used in the computation. The routine expects bins to contain the numeric labels of your bins so that it can check for repeated bin labels and warn you if repeats occur, but otherwise doesn't use the information.
You could do away with bins and compute the distance with
KL = sum(P .* (log2(P)-log2(Q)));
without using the Matlab Central versions. However the version you link to performs the abovementioned minimal checks and in addition allows computation of two alternative distances (consult the documentation).
The version linked to by eigenchris checks that no histogram bins are empty (which would make the computation blow up numerically) and if there are, removes their contribution to the sum (not sure this is entirely appropriate - consult an expert on the subject). It should probably also be aware of the exact form of the formula, specifically note the use of log2 above versus natural logarithm in the version linked to by eigenchris.

Index synchronization

I have a one column which shows my distance and another column with corresponding time like below:
Y=[1;3;6;7.5;7;10;13] and t=[1;2;3;5;8;10;11]
In case of plotting, it would be easy just using below command:
plot(t,Y)
which gives me what I want, a vector of distance according to corresponding times but what if I want to use another function like findpeak on mix of my column not just over one column like Y.
Indeed I want to find the local peak in a vector which its time sequence in not completed. I have to interpolate missing data in Y at the corresponding time by NaN. Indeed my data should be Like below: Y=[1;3;6;NaN;7.5;NaN;NaN;7;NaN;10;13] and t=[1;2;3;4;5;6;7;8;9;10;11]. Firstly, how to do that, and secondly how could I find the peaks in Y vector with NaN.
I hope I well explained the problem.
Thanks
This is two questions (which have been answered on SO before): first question is how to interpolate within a dataset to estimate missing points. Use interp1 to interpolate missing entries:
Y=[1;3;6;NaN;7.5;NaN;NaN;7;NaN;10;13];
t=[1;2;3;4;5;6;7;8;9;10;11];
% here's where it all happens ...
ind = ~isnan(Y);
Ynew = interp1(t(ind),Y(ind),t,'cubic');
figure
plot(t,Y,'o')
hold on
plot(t,Ynew,':.')
The second question is how to find peaks. This is a more open ended question which depends on what you are looking for (are local maxima ok, what kind of thresholds are you using?). You can use findpeaks for instance:
[ypeaks ipeaks]=findpeaks(Ynew);
plot(t(ipeaks),ypeaks,'*r')

Clustering an image using Gaussian mixture models

I want to use GMM(Gaussian mixture models for clustering a binary image and also want to plot the cluster centroids on the binary image itself.
I am using this as my reference:
http://in.mathworks.com/help/stats/gaussian-mixture-models.html
This is my initial code
I=im2double(imread('sil10001.pbm'));
K = I(:);
mu=mean(K);
sigma=std(K);
P=normpdf(K, mu, sigma);
Z = norminv(P,mu,sigma);
X = mvnrnd(mu,sigma,1110);
X=reshape(X,111,10);
scatter(X(:,1),X(:,2),10,'ko');
options = statset('Display','final');
gm = fitgmdist(X,2,'Options',options);
idx = cluster(gm,X);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(X(cluster1,1),X(cluster1,2),10,'r+');
hold on
scatter(X(cluster2,1),X(cluster2,2),10,'bo');
hold off
legend('Cluster 1','Cluster 2','Location','NW')
P = posterior(gm,X);
scatter(X(cluster1,1),X(cluster1,2),10,P(cluster1,1),'+')
hold on
scatter(X(cluster2,1),X(cluster2,2),10,P(cluster2,1),'o')
hold off
legend('Cluster 1','Cluster 2','Location','NW')
clrmap = jet(80); colormap(clrmap(9:72,:))
ylabel(colorbar,'Component 1 Posterior Probability')
But the problem is that I am unable to plot the cluster centroids received from GMM in the primary binary image.How do i do this?
**Now suppose i have 10 such images in a sequence And i want to store the information of their mean position in two cell array then how do i do that.This is my code foe my new question **
images=load('gait2go.mat');%load the matrix file
for i=1:10
I{i}=images.result{i};
I{i}=im2double(I{i});
%determine 'white' pixels, size of image can be [M N], [M N 3] or [M N 4]
Idims=size(I{i});
whites=true(Idims(1),Idims(2));
df=I{i};
%we add up the various color channels
for colori=1:size(df,3)
whites=whites & df(:,:,colori)>0.5;
end
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(whites);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
cInd = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
%get clusterwise means
meanx=zeros(K,1);
meany=zeros(K,1);
for i=1:K
meanx(i)=mean(datax(cInd==i));
meany(i)=mean(datay(cInd==i));
end
xc{i}=meanx(i);%cell array contaning the position of the mean for the 10
images
xb{i}=meany(i);
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to
image
axis equal;
hold on;
scatter(meany,-meanx,20,'+'); %same funky coordinates
end
I am able to get 10 images segmented but no the values of themean stored in the cell arrays xc and xb.They r only storing [] in place of the values of means
I decided to post an answer to your question (where your question was determined by a maximum-likelihood guess:P), but I wrote an extensive introduction. Please read carefully, as I think you have difficulties understanding the methods you want to use, and you have difficulties understanding why others can't help you with your usual approach of asking questions. There are several problems with your question, both code-related and conceptual. Let's start with the latter.
The problem with the problem
You say that you want to cluster your image with Gaussian mixture modelling. While I'm generally not familiar with clustering, after a look through your reference and the wonderful SO answer you cited elsewhere (and a quick 101 from #rayryeng) I think you are on the wrong track altogether.
Gaussian mixture modelling, as its name suggests, models your data set with a mixture of Gaussian (i.e. normal) distributions. The reason for the popularity of this method is that when you do measurements of all sorts of quantities, in many cases you will find that your data is mostly distributed like a normal distribution (which is actually the reason why it's called normal). The reason behind this is the central limit theorem, which implies that the sum of reasonably independent random variables tends to be normal in many cases.
Now, clustering, on the other hand, simply means separating your data set into disjoint smaller bunches based on some criteria. The main criterion is usually (some kind of) distance, so you want to find "close lumps of data" in your larger data set. You usually need to cluster your data before performing a GMM, because it's already hard enough to find the Gaussians underlying your data without having to guess the clusters too. I'm not familiar enough with the procedures involved to tell how well GMM algorithms can work if you just let them work on your raw data (but I expect that many implementations start with a clustering step anyway).
To get closer to your question: I guess you want to do some kind of image recognition. Looking at the picture, you want to get more strongly correlated lumps. This is clustering. If you look at a picture of a zoo, you'll see, say, an elephant and a snake. Both have their distinct shapes, and they are well separated from one another. If you cluster your image (and the snake is not riding the elephant, neither did it eat it), you'll find two lumps: one lump elephant-shaped, and one lump snake-shaped. Now, it wouldn't make sense to use GMM on these data sets: elephants, and especially snakes, are not shaped like multivariate Gaussian distributions. But you don't need this in the first place, if you just want to know where the distinct animals are located in your picture.
Still staying with the example, you should make sure that you cluster your data into an appropriate number of subsets. If you try to cluster your zoo picture into 3 clusters, you might get a second, spurious snake: the nose of the elephant. With an increasing number of clusters your partitioning might make less and less sense.
Your approach
Your code doesn't give you anything reasonable, and there's a very good reason for that: it doesn't make sense from the start. Look at the beginning:
I=im2double(imread('sil10001.pbm'));
K = I(:);
mu=mean(K);
sigma=std(K);
X = mvnrnd(mu,sigma,1110);
X=reshape(X,111,10);
You read your binary image, convert it to double, then stretch it out into a vector and compute the mean and deviation of that vector. You basically smear your intire image into 2 values: an average intensity and a deviation. And THEN you generate 111*10 standard normal points with these parameters, and try to do GMM on the first two sets of 111. Which are both independently normal with the same parameter. So you probably get two overlapping Gaussians around the same mean with the same deviation.
I think the examples you found online confused you. When you do GMM, you already have your data, so no pseudo-normal numbers should be involved. But when people post examples, they also try to provide reproducible inputs (well, some of them do, nudge nudge wink wink). A simple method for this is to generate a union of simple Gaussians, which can then be fed into GMM.
So, my point is, that you don't have to generate random numbers, but have to use the image data itself as input to your procedure. And you probably just want to cluster your image, instead of actually using GMM to draw potatoes over your cluster, since you want to cluster body parts in an image about a human. Most body parts are not shaped like multivariate Gaussians (with a few distinct exceptions for men and women).
What I think you should do
If you really want to cluster your image, like in the figure you added to your question, then you should use a method like k-means. But then again, you already have a program that does that, don't you? So I don't really think I can answer the question saying "How can I cluster my image with GMM?". Instead, here's an answer to "How can I cluster my image?" with k-means, but at least there will be a piece of code here.
%set infile to what your image file will be
infile='sil10001.pbm';
%read file
I=im2double(imread(infile));
%determine 'white' pixels, size of image can be [M N], [M N 3] or [M N 4]
Idims=size(I);
whites=true(Idims(1),Idims(2));
%we add up the various color channels
for colori=1:Idims(3)
whites=whites & I(:,:,colori)>0.5;
end
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(whites);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
cInd = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
%get clusterwise means
meanx=zeros(K,1);
meany=zeros(K,1);
for i=1:K
meanx(i)=mean(datax(cInd==i));
meany(i)=mean(datay(cInd==i));
end
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to image
axis equal;
hold on;
scatter(meany,-meanx,20,'ko'); %same funky coordinates
Here's what this does. It first reads your image as double like yours did. Then it tries to determine "white" pixels by checking that each color channel (of which can be either 1, 3 or 4) is brighter than 0.5. Then your input data points to the clustering will be the x and y "coordinates" (i.e. indices) of your white pixels.
Next it does the clustering via kmeans. This part of the code is loosely based on the already cited answer of Amro. I had to set a large maximal number of iterations, as the problem is ill-posed in the sense that there aren't 10 clear clusters in the picture. Then we compute the mean for each cluster, and plot the clusters with gscatter, and the means with scatter. Note that in order to have the picture facing in the right directions in a scatter plot you have to shift around the input coordinates. Alternatively you could define datax and datay correspondingly at the beginning.
And here's my output, run with the already processed figure you provided in your question:
I do believe you must had made a naive mistake in the plot and that's why you see just a straight line: You are plotting only the x values.
In my opinion, the second argument in the scatter command should be X(cluster1,2) or X(cluster2,2) depending on which scatter command is being used in the code.
The code can be made more simple:
%read file
I=im2double(imread('sil10340.pbm'));
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(I);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
[cInd, c] = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to
image
axis equal;
hold on;
scatter(c(:,2),-c(:,1),20,'ko'); %same funky coordinates
I don't think there is nay need for the looping as the c itself return a 10x2 double array which contains the position of the means