Matlab calculate outliers from data and time they occur - matlab

In Matlab I have a large matrix A. The first column of the matrix contains a time in seconds. The second to 13th column contain results from a calculation. For each column (except the first) I calculated the whisker by:
quantile(A,[.75])-1.5*(quantile(A,[.75])-quantile(A,[.25]))
Now I would like to now how many outliers (= values below whisker) there are in each column, and when they occur. This will give me the ability to calculate how much the outliers are spread over time.
I prefer to create a loop which gives me 12 martices containing two columns. The second column should contain the values of the outliers (= values of cells below whisker) without any zero's in between, and the first column should contain the time at which a outlier occurs (chronologically).
How can I create this?
regards,
Vincent

let,
A =
0.6260 0.7690 0.1209 0.5523 0.0495
0.6609 0.5814 0.8627 0.6299 0.4896
0.7298 0.9283 0.4843 0.0320 0.1925
0.8908 0.5801 0.8449 0.6147 0.1231
0.9823 0.0170 0.2094 0.3624 0.2055
for second column:
B = quantile(A(:,2),[.75])-1.5*(quantile(A(:,2),[.75])-quantile(A(:,2),[.25]))
Then,
index = find(A(:,2) < B)
value_outliner = A(index,2)
outliner_time = A(index,1)

No need to loop: use matrix operations and logical indexing instead.
Assuming you have a matrix A and outlier threshold thr is a 1x12 vector with the threshold for each column:
vals = A(:,2:13);
outliers = bsxfun(#lt, vals, thr); #% #lt is 'less than' function handle
#% outliers is a Nx12 logical matrix with true(1) where the value < threshold
#% and false(0) otherwise.
To get the time when these outliers occurred (for a given column, let's say column 2 of the data portion of the original matrix):
t = A(outliers(:,2), 1);
#% ^____________ logical index of rows where outliers occurred in that column
You can also easily get the number of outliers in each column (or row) by summing:
num_outliers = sum(outliers,1);

Related

Replace element in matrix in row with 1 and 0 matlab

I have an n x n matrix in MATLAB. In every row, if the value in each element is higher than a certain threshold, replace that element with a 1. Else, with a 0.
NOTICE: In every row, we compare the value of element with different threshold.
For element-wise comparison of two matrices of the same size, use the ">" operator e.g. result = data > threshold (this will return ones and zeroes depending on whether the condition is satisfied or not).
Suppose you have your data in a matrix called data and your thresholds in a column vector called thresholds (i.e. length(thresholds) == size(data, 1)). You can create an array the same size as the data matrix using repmat: thresholdsMatrix = repmat(thresholds, 1, size(data, 2)).
You can then compare this to your data:
result = data > repmat(thresholds, 1, size(data, 2)).
This should give you the result you want.
[Note that you can also directly compare the vector to the matrix without using repmat i.e. result = data > thresholds, but IMO this can be unclear and may lead to unexpected behaviour]

Partial sum of a vector

For a vector v (e.g. v=[1,2,3,4,5]), and two index vectors (e.g. a=[1,1,1,2,3] and b=[3,4,5,5,5], with all a(i)<b(i)), I would like to construct w=sum(v(a:b)), which gives the values
w = zeros(length(a),1);
for i = 1:length(a)
w(i)=sum(v(a(i):b(i)));
end
It is slow when length(a) is large. Can I compute w without the for loop?
Yes! The nth element of cumsum(v) is the sum of the first n elements in v, so just take that and subtract the sum of the elements that you don't want to include:
v=[1,2,3,4,5]
a=[1,1,1,2,3]
b=[3,4,5,5,5]
C=cumsum(v)
C(b)-C(a)+v(a)
%// or alternatively
C=cumsum([0 v])
C(b+1)-C(a)
The following code works, but it is of course much less readable:
% assume v is a column vector
units = 1:length(v); units = units'; %units is a column vector
units_matrix = repmat(units, [1 length(a)]);
a_matrix = repmat(a, [length(v) 1]); % assuming a is is a row vector
b_matrix = repmat(b, [length(v) 1]);
weights = (units_matrix>=a_matrix) & (units_matrix<=b_matrix);
v_matrix = repmat(v, [1 length(a)]);
w = sum(v_matrix.*weights);
Explanation:
v_matrix contains copies of v. The summation will be done along
the column, so we need to prepare the other needed information in
vectorized form. units_matrix contains the indexes in v along the
columns. The columns are identical. a_matrix and b_matrix, in
each of their column, contains the indexes that are relevant for each
partial summation. All rows are identical. weights is a logical
matrix where, for each column, the indexes contained in
units_matrix between the corresponding a and b are 1 (true),
and the rest is 0. The element-wise multiplication thus filters the
"right" values, and all the indexes outside the range (again, for
each different column) is multiplied by zero. w is then he result
of the sum function, i.e. a row vector (every column of the
"filtered" matrix is summed).

delete certain columns of matrix when number of zero elements exceeds threshold avoiding loop

I have a quite big (107 x n) matrix X. Within these n columns, each three columns belong to each other. So, the first three columns of matrix X build a block, then columns 4,5,6 and so on.
Within each block, the first 100 row elements of the first column are important X(1:100,1:3:end). Whenever in this first column the number of zeros or NaNs is greater or equal 20, it should delete the whole block.
Is there a way to do this without a loop?
Thanks for any advice!
Assuming the number of columns of the input to be a multiple of 3, there could be two approaches here.
Approach #1
%// parameters
rl = 100; %// row limit
cl = 20; %// count limit
X1 = X(1:rl,1:3:end) %// Important elements from input
match_mat = isnan(X1) | X1==0 %// binary array of matches
match_blk_id = find(sum(match_mat)>=cl) %// blocks that satisfy requirements
match_colstart = (match_blk_id-1).*3+1 %// start column indices that satisfy
all_col_ind = bsxfun(#plus,match_colstart,[0:2]') %//'columns indices to be removed
X(:,all_col_ind)=[] %// final output after removing to be removed columns
Or if you prefer "compact" codes -
X1 = X(1:rl,1:3:end);
X(:,bsxfun(#plus,(find(sum(isnan(X1) | X1==0)>=cl)-1).*3+1,[0:2]'))=[];
Approach #2
X1 = X(1:rl,1:3:end)
match_mat = isnan(X1) | X1==0 %// binary array of matches
X(:,repmat(sum(match_mat)>=cl,[3 1]))=[] %// Find matching blocks, replicate to
%// next two columns and remove them from X
Note: If X is not a multiple of 3, use this before using the codes - X = [X zeros(size(X,1) ,3 - mod(size(X,2),3))].

N-Dimensional Histogram Counts

I am currently trying to code up a function to assign probabilities to a collection of vectors using a histogram count. This is essentially a counting exercise, but requires some finesse to be able to achieve efficiently. I will illustrate with an example:
Say that I have a matrix X = [x1, x2....xM] with N rows and M columns. Here, X represents a collection of M, N-dimensional vectors. IN other words, each of the columns of X is an N-dimensional vector.
As an example, we can generate such an X for M = 10000 vectors and N = 5 dimensions using:
X = randint(5,10000)
This will produce a 5 x 10000 matrix of 0s and 1s, where each column is represents a 5 dimensional vector of 1s and 0s.
I would like to assign a probability to each of these vectors through a basic histogram count. The steps are simple: first find the unique columns of X; second, count the number of times each unique column occurs. The probability of a particular occurrence is then the #of times this column was in X / total number of columns in X.
Returning to the example above, I can do the first step using the unique function in MATLAB as follows:
UniqueXs = unique(X','rows')'
The code above will return UniqueXs, a matrix with N rows that only contains the unique columns of X. Note that the transposes are due to weird MATLAB input requirements.
However, I am unable to find a good way to count the number of times each of the columns in UniqueX is in X. So I'm wondering if anyone has any suggestions?
Broadly speaking, I can think of two ways of achieving the counting step. The first way would be to use the find function, though I think this may be slow since find is an elementwise operation. The second way would be to call unique recursively as it can also provide the index of one of the unique columns in X. This should allow us to remove that column from X and redo unique on the resulting X and keep counting.
Ideally, I think that unique might already be doing some counting so the most efficient way would probably be to work without the built-in functions.
Here are two solutions, one assumes all values are either 0's or 1's (just like the example in your description), the other does not. Both codes should be very fast (more so the one with binary values), even on large data.
1) only zeros and ones
%# random vectors of 0's and 1's
x = randi([0 1], [5 10000]); %# RANDINT is deprecated, use RANDI instead
%# convert each column to a binary string
str = num2str(x', repmat('%d',[1 size(x,1)])); %'
%# convert binary representation to decimal number
num = (str-'0') * (2.^(size(s,2)-1:-1:0))'; %'# num = bin2dec(str);
%# count frequency of how many each number occurs
count = accumarray(num+1,1); %# num+1 since it starts at zero
%# assign probability based on count
prob = count(num+1)./sum(count);
2) any positive integer
%# random vectors with values 0:MAX_NUM
x = randi([0 999], [5 10000]);
%# format vectors as strings (zero-filled to a constant length)
nDigits = ceil(log10( max(x(:)) ));
frmt = repmat(['%0' num2str(nDigits) 'd'], [1 size(x,1)]);
str = cellstr(num2str(x',frmt)); %'
%# find unique strings, and convert them to group indices
[G,GN] = grp2idx(str);
%# count frequency of occurrence
count = accumarray(G,1);
%# assign probability based on count
prob = count(G)./sum(count);
Now we can see for example how many times each "unique vector" occurred:
>> table = sortrows([GN num2cell(count)])
table =
'000064850843749' [1] # original vector is: [0 64 850 843 749]
'000130170550598' [1] # and so on..
'000181606710020' [1]
'000220492735249' [1]
'000275871573376' [1]
'000525617682120' [1]
'000572482660558' [1]
'000601910301952' [1]
...
Note that in my example with random data, the vector space becomes very sparse (as you increase the maximum possible value), thus I wouldn't be surprised if all counts were equal to 1...

How can I generate a binary matrix with specific patterns?

I have a binary matrix of size m-by-n. Given below is a sample binary matrix (the real matrix is much larger):
1010001
1011011
1111000
0100100
Given p = m*n, I have 2^p possible matrix configurations. I would like to get some patterns which satisfy certain rules. For example:
I want not less than k cells in the jth column as zero
I want the sum of cell values of the ith row greater than a given number Ai
I want at least g cells in a column continuously as one
etc....
How can I get such patterns satisfying these constraints strictly without sequentially checking all the 2^p combinations?
In my case, p can be a number like 2400, giving approximately 2.96476e+722 possible combinations.
Instead of iterating over all 2^p combinations, one way you could generate such binary matrices is by performing repeated row- and column-wise operations based on the given constraints you have. As an example, I'll post some code that will generate a matrix based on the three constraints you have listed above:
A minimum number of zeroes per column
A minimum sum for each row
A minimum sequential length of ones per column
Initializations:
First start by initializing a few parameters:
nRows = 10; % Row size of matrix
nColumns = 10; % Column size of matrix
minZeroes = 5; % Constraint 1 (for columns)
minRowSum = 5; % Constraint 2 (for rows)
minLengthOnes = 3; % Constraint 3 (for columns)
Helper functions:
Next, create a couple of functions for generating column vectors that match constraints 1 and 3 from above:
function vector = make_column
vector = [false(minZeroes,1); true(nRows-minZeroes,1)]; % Create vector
[vector,maxLength] = randomize_column(vector); % Randomize order
while maxLength < minLengthOnes, % Loop while constraint 3 is not met
[vector,maxLength] = randomize_column(vector); % Randomize order
end
end
function [vector,maxLength] = randomize_column(vector)
vector = vector(randperm(nRows)); % Randomize order
edges = diff([false; vector; false]); % Find rising and falling edges
maxLength = max(find(edges == -1)-find(edges == 1)); % Find longest
% sequence of ones
end
The function make_column will first create a logical column vector with the minimum number of 0 elements and the remaining elements set to 1 (using the functions TRUE and FALSE). This vector will undergo random reordering of its elements until it contains a sequence of ones greater than or equal to the desired minimum length of ones. This is done using the randomize_column function. The vector is randomly reordered using the RANDPERM function to generate a random index order. The edges where the sequence switches between 0 and 1 are detected using the DIFF function. The indices of the edges are then used to find the length of the longest sequence of ones (using FIND and MAX).
Generate matrix columns:
With the above two functions we can now generate an initial binary matrix that will at least satisfy constraints 1 and 3:
binMat = false(nRows,nColumns); % Initialize matrix
for iColumn = 1:nColumns,
binMat(:,iColumn) = make_column; % Create each column
end
Satisfy the row sum constraint:
Of course, now we have to ensure that constraint 2 is satisfied. We can sum across each row using the SUM function:
rowSum = sum(binMat,2);
If any elements of rowSum are less than the minimum row sum we want, we will have to adjust some column values to compensate. There are a number of different ways you could go about modifying column values. I'll give one example here:
while any(rowSum < minRowSum), % Loop while constraint 2 is not met
[minValue,rowIndex] = min(rowSum); % Find row with lowest sum
zeroIndex = find(~binMat(rowIndex,:)); % Find zeroes in that row
randIndex = round(1+rand.*(numel(zeroIndex)-1));
columnIndex = zeroIndex(randIndex); % Choose a zero at random
column = binMat(:,columnIndex);
while ~column(rowIndex), % Loop until zero changes to one
column = make_column; % Make new column vector
end
binMat(:,columnIndex) = column; % Update binary matrix
rowSum = sum(binMat,2); % Update row sum vector
end
This code will loop until all the row sums are greater than or equal to the minimum sum we want. First, the index of the row with the smallest sum (rowIndex) is found using MIN. Next, the indices of the zeroes in that row are found and one of them is randomly chosen as the index of a column to modify (columnIndex). Using make_column, a new column vector is continuously generated until the 0 in the given row becomes a 1. That column in the binary matrix is then updated and the new row sum is computed.
Summary:
For a relatively small 10-by-10 binary matrix, and the given constraints, the above code usually completes in no more than a few seconds. With more constraints, things will of course get more complicated. Depending on how you choose your constraints, there may be no possible solution (for example, setting minRowSum to 6 will cause the above code to never converge to a solution).
Hopefully this will give you a starting point to begin generating the sorts of matrices you want using vectorized operations.
If you have enough constraints, exploring all possible matrices could be attempted:
// Explore all possibilities starting at POSITION (0..P-1)
explore(int position)
{
// Check if one or more constraints can't be verified anymore with
// all values currently set.
invalid = ...;
if (invalid) return;
// Do we have a solution?
if (position >= p)
{
// print the matrix
return;
}
// Set one more value and continue exploring
for (int value=0;value<2;value++)
{ matrix[position] = value; explore(position+1); }
}
If the number of constraints is low, this approach will take too much time.
In this case, for the kind of constraints you gave as examples, simulated annealing may be a good solution.
You must design an energy function, high when all constraints are met. That would be something like that:
Generate a random matrix
Compute energy E0
Change one cell
Compute energy E1
If E1>E0, or E0-E1 is smaller than f(temperature), keep it, otherwise reverse the move
Update temperature, and goto 2 unless stop criterion is reached
If all the contraints relate to columns (as is the case in the question), then you can find all possible valid columns and check that each column in the matrix is in this set. (i.e. when you consider each column independently, you reduce the number of possibilities a lot.)
I might be way off here, but I remember doing something similar once with some genetic algorithm.
Check out pseudo boolean constraints (also called 0-1 integer programming).
This is virtually impossible if your constraint set is complex enough. You might try to use a stochastic optimizer, like simulated annealing, particle swarm optimization, or a genetic algorithm to find a feasible solution.
However, if you can generate one (non-random) solution to such a problem, then often you can generate others by random permutations made to the existing solution.