Identify and store outliers MATLAB - matlab

Hello I have spectral data collected over time. I want to store the outliers and there index so that the user can see where the outliers are. I have searched on how to find outliers and can't seem to find a solution to my problem.
An outlier can be defined as 1.5 times the standard deviation since this is what I've mostly seen.
data = rand(1024,20) %spectral data over time

If you can upgrade, you can check out the new isoutlier and filloutliers functions in R2017a. Searching for outliers more than 1.5x the standard deviation would correspond to using the 'mean' method for finding the outliers, and specifying the 'ThresholdFactor' name-value pair to a value of 1.5. If you want a windowed approach, you can instead use the 'movmean' method and specify a window size.

Related

Peak Detection Matlab

I'm trying to get all large peaks values of this signal :
As you can see there is one large peak followed by one smaller peak, and I want to get each value of the largest peak. I already tried this [pks1,locs1] = findpeaks(y1,'MinPeakHeight',??); but I can't find what I can write instead of the ?? knowing that the signal will not be the same every time (of course there will ever be a large+smaller peak schema but time intervals and amplitudes can change). I tried a lot of things using std(), mean(),max() but none of the combination works properly.
Any ideas on how can I solve the problem ?
You could try using the 'MinPeakDistance' keyword and enter a minimum distance between the two peaks slightly higher than the distance between the large peak and the following small peak. So for example:
[pks1,locs1] = findpeaks(y1,'MinPeakDistance',0.3);
Edit:
If the time between peaks (and the following smaller one) varies a lot you'll probably have to do some post-processing. First find all the peaks including the smaller second ones. Then in your array of peaks remove every peak which is significantly lower than its two neighbours.
You could also try fiddling with 'MinPeakProminence'.
Generally these problems require a lot of calibration for the final few percent of the algorithms accuracy, and there's no universal cure.
I also recommend having a look at all the other options in the documentation.

Automatically truncating a curve to discard outliers in matlab

I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).

Select data based on a distribution in matlab

I have a set of data in a vector. If I were to plot a histogram of the data I could see (by clever inspection) that the data is distributed as the sum of three distributions;
One normal distribution centered around x_1 with variance s_1;
One normal distribution centered around x_2 with variance s_2;
Once lognormal distribution.
My data is obviously a subset of the 'real' data.
What I would like to do is to take a random subset of my data away from my data ensuring that the resulting subset is a reasonable representative sample of the original data.
I would like to do this as easily as possible in matlab but am new to both statistics and matlab and am unsure where to start.
Thank you for any help :)
If you can identify each of the 3 distributions (in the sense that you can estimate their parameters), one approach could be to select a random subset of your data and then try to estimate the parameters for each distribution and see whether they are close enough (according to your own definition of "close") to the parameters of the original distributions. You should repeat this process several time and look at the average difference given a random subset size.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

A Matlab histogram application

In my application, I have a number of data points and each are associated with a number and strength. I am trying to figure out how to sort these data points so that I can find the most frequent data point with the highest strength -- the answer will be sort of like an average between these two.
I can use hist() to generate the histogram of the data points and find which number occurs most often. However, I'm having trouble thinking of a way to sort the data point strengths by number easily. (I figure I can just multiply the hist of numbers with hist of strengths to find the best bin.) I don't think hist() can do this. Is there another way? Or am I limited to just binning the data point strengths manually by going through each number of bin?
I may be severely misinterpreting your problem, but why don't you use a 2D histogram routine (there are many in the FEX, such as this) and find the bin - corresponding to a range of numbers and a range of strengths - with the highest incidence of data points?