Classifying data based on a training set - matlab

I have some data that needs classifying. I've tried to use the classify function described here.
My sample is a matrix that has 1 column and 382 rows.
My training is a matrix with 1 column and 2 rows.
Grouping is causing me the issues. I've written: grouping = [a,b]; where a is one category and b is another.
This gives me the error:
Undefined function or variable 'a'.
Error in discrimtrialab (line 89)
grouping = [a,b];
Further to this, how do I classify a group, ie. not just the exact value in training?
Here is my code:
a = -0.09306:0.0001:0.00476;
b = -0.02968:0.0001:0.01484;
%training = groups (odour index)
training = [-0.09306:0.00476; -0.02968:0.01484;];
%grouping variable
group = [a,b]
%classify
[class, err] = classify(sample, training, group, 'linear');
class(a)
(note - there is some processing above this, but it is irrelevant to the question)

From the documentation:
class = classify(sample,training,group) classifies each row of the
data in sample into one of the groups in training. (See Grouped Data.)
sample and training must be matrices with the same number of columns.
group is a grouping variable for training. Its unique values define
groups; each element defines the group to which the corresponding row
of training belongs.
That is, "group" must have the same number of rows as training. From the example in the help:
load fisheriris
SL = meas(51:end,1);
SW = meas(51:end,2);
group = species(51:end);
SL & SW are 100 x 1 matrices to be used for training (two different measurements made on each of 100 samples). group is a 100 x 1 cell array of strings indicating which species each of those measurements belongs to. It could also be a char array or simply a list of numbers (1,2,3) where each number refers to a different group, but it must have 100 rows.
e.g. if your training matrix was a 100 x 1 matrix of doubles, where the first 50 were values that belonged to 'a' and the second 50 were values that belonged to 'b' your group matrix could be:
group = [repmat('a',50,1);repmat('b',50,1)];
However, if all your "groups" are just non-overlapping ranges as stated here in the comments:
What I want classify to do is work out whether or not each number in
"sample" is type A, ie, in the range -0.04416 +/- 0.0163, or type B,
with the range -0.00914 +/- 0.00742
then you don't really need classify. To extract the values from sample which are equal to a value plus or minus some tolerance:
sample1 = sample(abs(sample-value)<tol);
ETA after latest comment: "group" can be a numeric vector, so if you have a training data set which you need to group based on the ranges of some variable, then something like (this code is unchecked but the basic principle should be sound):
%presume "data" is our training data (381 x 3) and "sample" (n x 2) is the data we want to classify
group = zeros(length(data),1); %empty matrix
% first column is variable for grouping, second + third are data equivalent to the entries in "sample".
training = data(:,2:3);
% find where data(:,1) meets whatever our requirements are and label groups with numbers
group(data(:,1)<3)=1; % group "1" is wherever first column is below 3
group(data(:,1)>7)=2; % group "2" is wherever first column is above 7
group(group==0)=NaN; % set any remaining data to NaN
%now we classify "sample" based on "data" which has been split into "training" and "group" variables
class = classify(sample, training, group);

Related

How to slice and store a set of curves in multiple variables in matlab?

I have a table with 2 columns (x and y value) and 100 rows (the x values repeat in a certain interval).
Therfore I would like to perform the following task for changeable table sizes (only changes in the row size!):
I want to determine the amount of repetitions of the x values and save this information as a variable named n. Here the amount of repetition is 5 (each x value occurs 5 times in total).
I want to know the range of the x values from repetition circle and save this information as R = height(range); Here the x range is [0,20]
With the above informaton I would like the create smaller tables where only one repetition of the x values is present
How could I implement this in matlab?
Stay safe and healthy,
Greta
This approach converts the Table to an array/matrix using the table2array() function for further processing. To find the repeated pattern in the x-values the unique() function is used to retrieve the vector that is repeated multiple times. The range of the values can be calculated by using the min() and max() functions and concatenating the values in a 2 element array. The assignin() function can then be used to create a set of smaller tables that separate the y-values according to the x-value repetitions.
Table Used to Test Script:
x = repmat((1:20).',[5 1]);
y = rand(100,1);
Table = array2table([x y]);
Script:
Array = table2array(Table);
Unique_X_Values = unique(Array(:,1));
Number_Of_Repetitions = length(Array)/length(Unique_X_Values);
Range = [min(Array(:,1)) max(Array(:,1))];
Y_Reshaped = reshape(Array(:,2),[numel(Array(:,2))/Number_Of_Repetitions Number_Of_Repetitions]);
for Column_Index = 1: Number_Of_Repetitions
Variable_Name = ['Small_Tables_' num2str(Column_Index)];
assignin('base',Variable_Name,array2table(Y_Reshaped(:,Column_Index)));
eval(Variable_Name);
end
fprintf("Number of repetitions: %d\n",Number_Of_Repetitions);
fprintf("Range: [%d,%d]\n",Range(1),Range(2));

How to read the indices of my max value in any given array using a created function

I have created a function that should be able to read in any mxn matrix and gives me the maximum value of the entire matrix (not just per column) and what its indices are.
function [ BIGGEST ] = singlemax( x )
[Largest_values row]=max(x)
[biggest_number column] = max(max(x))
end
This function gives me all the information I need, however it is not very clean as it gets messy the larger the matrix.
The real problem area is printing out the row in which the maxima is located.
Largest_values =
0.7750 0.9122 0.7672 0.9500 0.6871
row =
3 2 3 2 2
biggest_number =
0.9500
column =
4
This is my print out given a random matrix as an input.With the function I have created I cannot read the indices of my max value in any given array using a created function. If I could somehow relate the maximas from each column and there corresponding row (such as making the results a matrix with the column max on top and the row index on bottom, all within the same respective columns )I could display the row of the absolute maximum.
Here's one approach:
value = max(x(:));
[rowIndex,columnIndex] = ind2sub(size(x),find(x==value));
Read the ind2sub documentation for more details.
Edited to modify so that it finds indices of all occurrences of the maximum value.

Matlab matching first column of a row as index and then averaging all columns in that row

I need help with taking the following data which is organized in a large matrix and averaging all of the values that have a matching ID (index) and outputting another matrix with just the ID and the averaged value that trail it.
File with data format:
(This is the StarData variable)
ID>>>>Values
002141865 3.867144e-03 742.000000 0.001121 16.155089 6.297494 0.001677
002141865 5.429278e-03 1940.000000 0.000477 16.583748 11.945627 0.001622
002141865 4.360715e-03 1897.000000 0.000667 16.863406 13.438383 0.001460
002141865 3.972467e-03 2127.000000 0.000459 16.103060 21.966853 0.001196
002141865 8.542932e-03 2094.000000 0.000421 17.452007 18.067214 0.002490
Do not be mislead by the examples I posted, that first number is repeated for about 15 lines then the ID changes and that goes for an entire set of different ID's, then they are repeated as a whole group again, think first block of code = [1 2 3; 1 5 9; 2 5 7; 2 4 6] then the code repeats with different values for the columns except for the index. The main difference is the values trailing the ID which I need to average out in matlab and output a clean matrix with only one of each ID fully averaged for all occurrences of that ID.
Thanks for any help given.
A modification of this answer does the job, as follows:
[value_sort ind_sort] = sort(StarData(:,1));
[~, ii, jj] = unique(value_sort);
n = diff([0; ii]);
averages = NaN(length(n),size(StarData,2)); % preallocate
averages(:,1) = StarData(ii,1);
for col = 2:size(StarData,2)
averages(:,col) = accumarray(jj,StarData(ind_sort,col))./n;
end
The result is in variable averages. Its first column contains the values used as indices, and each subsequent column contains the average for that column according to the index value.
Compatibility issues for Matlab 2013a onwards:
The function unique has changed in Matlab 2013a. For that version onwards, add 'legacy' flag to unique, i.e. replace second line by
[~, ii, jj] = unique(value_sort,'legacy')

N-Dimensional Histogram Counts

I am currently trying to code up a function to assign probabilities to a collection of vectors using a histogram count. This is essentially a counting exercise, but requires some finesse to be able to achieve efficiently. I will illustrate with an example:
Say that I have a matrix X = [x1, x2....xM] with N rows and M columns. Here, X represents a collection of M, N-dimensional vectors. IN other words, each of the columns of X is an N-dimensional vector.
As an example, we can generate such an X for M = 10000 vectors and N = 5 dimensions using:
X = randint(5,10000)
This will produce a 5 x 10000 matrix of 0s and 1s, where each column is represents a 5 dimensional vector of 1s and 0s.
I would like to assign a probability to each of these vectors through a basic histogram count. The steps are simple: first find the unique columns of X; second, count the number of times each unique column occurs. The probability of a particular occurrence is then the #of times this column was in X / total number of columns in X.
Returning to the example above, I can do the first step using the unique function in MATLAB as follows:
UniqueXs = unique(X','rows')'
The code above will return UniqueXs, a matrix with N rows that only contains the unique columns of X. Note that the transposes are due to weird MATLAB input requirements.
However, I am unable to find a good way to count the number of times each of the columns in UniqueX is in X. So I'm wondering if anyone has any suggestions?
Broadly speaking, I can think of two ways of achieving the counting step. The first way would be to use the find function, though I think this may be slow since find is an elementwise operation. The second way would be to call unique recursively as it can also provide the index of one of the unique columns in X. This should allow us to remove that column from X and redo unique on the resulting X and keep counting.
Ideally, I think that unique might already be doing some counting so the most efficient way would probably be to work without the built-in functions.
Here are two solutions, one assumes all values are either 0's or 1's (just like the example in your description), the other does not. Both codes should be very fast (more so the one with binary values), even on large data.
1) only zeros and ones
%# random vectors of 0's and 1's
x = randi([0 1], [5 10000]); %# RANDINT is deprecated, use RANDI instead
%# convert each column to a binary string
str = num2str(x', repmat('%d',[1 size(x,1)])); %'
%# convert binary representation to decimal number
num = (str-'0') * (2.^(size(s,2)-1:-1:0))'; %'# num = bin2dec(str);
%# count frequency of how many each number occurs
count = accumarray(num+1,1); %# num+1 since it starts at zero
%# assign probability based on count
prob = count(num+1)./sum(count);
2) any positive integer
%# random vectors with values 0:MAX_NUM
x = randi([0 999], [5 10000]);
%# format vectors as strings (zero-filled to a constant length)
nDigits = ceil(log10( max(x(:)) ));
frmt = repmat(['%0' num2str(nDigits) 'd'], [1 size(x,1)]);
str = cellstr(num2str(x',frmt)); %'
%# find unique strings, and convert them to group indices
[G,GN] = grp2idx(str);
%# count frequency of occurrence
count = accumarray(G,1);
%# assign probability based on count
prob = count(G)./sum(count);
Now we can see for example how many times each "unique vector" occurred:
>> table = sortrows([GN num2cell(count)])
table =
'000064850843749' [1] # original vector is: [0 64 850 843 749]
'000130170550598' [1] # and so on..
'000181606710020' [1]
'000220492735249' [1]
'000275871573376' [1]
'000525617682120' [1]
'000572482660558' [1]
'000601910301952' [1]
...
Note that in my example with random data, the vector space becomes very sparse (as you increase the maximum possible value), thus I wouldn't be surprised if all counts were equal to 1...

How can I generate a binary matrix with specific patterns?

I have a binary matrix of size m-by-n. Given below is a sample binary matrix (the real matrix is much larger):
1010001
1011011
1111000
0100100
Given p = m*n, I have 2^p possible matrix configurations. I would like to get some patterns which satisfy certain rules. For example:
I want not less than k cells in the jth column as zero
I want the sum of cell values of the ith row greater than a given number Ai
I want at least g cells in a column continuously as one
etc....
How can I get such patterns satisfying these constraints strictly without sequentially checking all the 2^p combinations?
In my case, p can be a number like 2400, giving approximately 2.96476e+722 possible combinations.
Instead of iterating over all 2^p combinations, one way you could generate such binary matrices is by performing repeated row- and column-wise operations based on the given constraints you have. As an example, I'll post some code that will generate a matrix based on the three constraints you have listed above:
A minimum number of zeroes per column
A minimum sum for each row
A minimum sequential length of ones per column
Initializations:
First start by initializing a few parameters:
nRows = 10; % Row size of matrix
nColumns = 10; % Column size of matrix
minZeroes = 5; % Constraint 1 (for columns)
minRowSum = 5; % Constraint 2 (for rows)
minLengthOnes = 3; % Constraint 3 (for columns)
Helper functions:
Next, create a couple of functions for generating column vectors that match constraints 1 and 3 from above:
function vector = make_column
vector = [false(minZeroes,1); true(nRows-minZeroes,1)]; % Create vector
[vector,maxLength] = randomize_column(vector); % Randomize order
while maxLength < minLengthOnes, % Loop while constraint 3 is not met
[vector,maxLength] = randomize_column(vector); % Randomize order
end
end
function [vector,maxLength] = randomize_column(vector)
vector = vector(randperm(nRows)); % Randomize order
edges = diff([false; vector; false]); % Find rising and falling edges
maxLength = max(find(edges == -1)-find(edges == 1)); % Find longest
% sequence of ones
end
The function make_column will first create a logical column vector with the minimum number of 0 elements and the remaining elements set to 1 (using the functions TRUE and FALSE). This vector will undergo random reordering of its elements until it contains a sequence of ones greater than or equal to the desired minimum length of ones. This is done using the randomize_column function. The vector is randomly reordered using the RANDPERM function to generate a random index order. The edges where the sequence switches between 0 and 1 are detected using the DIFF function. The indices of the edges are then used to find the length of the longest sequence of ones (using FIND and MAX).
Generate matrix columns:
With the above two functions we can now generate an initial binary matrix that will at least satisfy constraints 1 and 3:
binMat = false(nRows,nColumns); % Initialize matrix
for iColumn = 1:nColumns,
binMat(:,iColumn) = make_column; % Create each column
end
Satisfy the row sum constraint:
Of course, now we have to ensure that constraint 2 is satisfied. We can sum across each row using the SUM function:
rowSum = sum(binMat,2);
If any elements of rowSum are less than the minimum row sum we want, we will have to adjust some column values to compensate. There are a number of different ways you could go about modifying column values. I'll give one example here:
while any(rowSum < minRowSum), % Loop while constraint 2 is not met
[minValue,rowIndex] = min(rowSum); % Find row with lowest sum
zeroIndex = find(~binMat(rowIndex,:)); % Find zeroes in that row
randIndex = round(1+rand.*(numel(zeroIndex)-1));
columnIndex = zeroIndex(randIndex); % Choose a zero at random
column = binMat(:,columnIndex);
while ~column(rowIndex), % Loop until zero changes to one
column = make_column; % Make new column vector
end
binMat(:,columnIndex) = column; % Update binary matrix
rowSum = sum(binMat,2); % Update row sum vector
end
This code will loop until all the row sums are greater than or equal to the minimum sum we want. First, the index of the row with the smallest sum (rowIndex) is found using MIN. Next, the indices of the zeroes in that row are found and one of them is randomly chosen as the index of a column to modify (columnIndex). Using make_column, a new column vector is continuously generated until the 0 in the given row becomes a 1. That column in the binary matrix is then updated and the new row sum is computed.
Summary:
For a relatively small 10-by-10 binary matrix, and the given constraints, the above code usually completes in no more than a few seconds. With more constraints, things will of course get more complicated. Depending on how you choose your constraints, there may be no possible solution (for example, setting minRowSum to 6 will cause the above code to never converge to a solution).
Hopefully this will give you a starting point to begin generating the sorts of matrices you want using vectorized operations.
If you have enough constraints, exploring all possible matrices could be attempted:
// Explore all possibilities starting at POSITION (0..P-1)
explore(int position)
{
// Check if one or more constraints can't be verified anymore with
// all values currently set.
invalid = ...;
if (invalid) return;
// Do we have a solution?
if (position >= p)
{
// print the matrix
return;
}
// Set one more value and continue exploring
for (int value=0;value<2;value++)
{ matrix[position] = value; explore(position+1); }
}
If the number of constraints is low, this approach will take too much time.
In this case, for the kind of constraints you gave as examples, simulated annealing may be a good solution.
You must design an energy function, high when all constraints are met. That would be something like that:
Generate a random matrix
Compute energy E0
Change one cell
Compute energy E1
If E1>E0, or E0-E1 is smaller than f(temperature), keep it, otherwise reverse the move
Update temperature, and goto 2 unless stop criterion is reached
If all the contraints relate to columns (as is the case in the question), then you can find all possible valid columns and check that each column in the matrix is in this set. (i.e. when you consider each column independently, you reduce the number of possibilities a lot.)
I might be way off here, but I remember doing something similar once with some genetic algorithm.
Check out pseudo boolean constraints (also called 0-1 integer programming).
This is virtually impossible if your constraint set is complex enough. You might try to use a stochastic optimizer, like simulated annealing, particle swarm optimization, or a genetic algorithm to find a feasible solution.
However, if you can generate one (non-random) solution to such a problem, then often you can generate others by random permutations made to the existing solution.