Undefined F1 scores in multiclass classifications when model does not predict one class - matlab

I am trying to use F1 scores for model selection in multiclass classification.
I am calculating them class-wise and average over them:
(F1(class1)+F1(class1)+F1(class1))/3 = F1(total)
However, in some cases I get NaN values for the F1 score. Here is an example:
Let true_label = [1 1 1 2 2 2 3 3 3] and pred_label = [2 2 2 2 2 2 3 3 3].
Then the confusion matrix looks like:
C =[0 3 0; 0 3 0; 0 0 3]
Which means when I calculate the precision (to calculate the F1 score) for the first class, I obtain: 0/(0+0+0), which is not defined or NaN.
Firstly, am I making a mistake in calculating F1 scores or precisions here?
Secondly, how should I treat these cases in model selection? Ignore them or should I just set the F1 scores for this class to 0 (reducing the total F1 score for this model).
Any help would be greatly appreciated!

You need to avoid the division by zero for the precision in order to report meaningful results. You might find this answer useful, in which you explicitly report a poor outcome. Additionally, this implementation suggests an alternate way to differentiate in your reporting between good and poor outcomes.

Related

Vectorizing cell find and summing in Matlab

Would someone please show me how I can go about changing this code from an iterated to a vectorized implementation to speed up performance in Matlab? It takes approximately 8 seconds per i for i=1:20 on my machine currently.
classEachWordCount = zeros(nwords_train, nClasses);
for i=1:nClasses % (20 classes)
for j=1:nwords_train % (53975 words)
classEachWordCount(j,i) = sum(groupedXtrain{i}(groupedXtrain{i}(:,2)==j,3));
end
end
If context is helpful basically groupedXtrain is a cell of 20 matrices which represent different classes, where each class matrix has 3 columns: document#,word#,wordcount, and unequal numbers of rows (tens of thousands). I'm trying to figure out the count total of each word, for each class. So classEachWordCount should be a matrix of size 53975x20 where each row represents a different word and each column a different label. There's got to be a built-in function to assist in something like this, right?
for example groupedXtrain{1} might start off like:
doc#,word#,wordcount
1 1 3
1 2 1
1 4 3
1 5 1
1 8 2
2 2 1
2 5 4
2 6 2
As is mentioned in the comments, you can use accumarray to sum up the values in the third column for each unique value in the second column for each class
results = zeros(nwords_train, numel(groupedXtrain));
for k = 1:numel(groupedXtrain)
results(:,k) = accumarray(groupedXtrain{k}(:,2), groupedXtrain{k}(:,3), ...
[nwords_train 1], #sum);
end

Matlab matrix with fixed sum over rows

I'm trying to construct a matrix in Matlab where the sum over the rows is constant, but every combination is taken into account.
For example, take a NxM matrix where M is a fixed number and N will depend on K, the result to which all rows must sum.
For example, say K = 3 and M = 3, this will then give the matrix:
[1,1,1
2,1,0
2,0,1
1,2,0
1,0,2
0,2,1
0,1,2
3,0,0
0,3,0
0,0,3]
At the moment I do this by first creating the matrix of all possible combinations, without regard for the sum (for example this also contains [2,2,1] and [3,3,3]) and then throw away the element for which the sum is unequal to K
However this is very memory inefficient (especially for larger K and M), but I couldn't think of a nice way to construct this matrix without first constructing the total matrix.
Is this possible in a nice way? Or should I use a whole bunch of for-loops?
Here is a very simple version using dynamic programming. The basic idea of dynamic programming is to build up a data structure (here S) which holds the intermediate results for smaller instances of the same problem.
M=3;
K=3;
%S(k+1,m) will hold the intermediate result for k and m
S=cell(K+1,M);
%Initialisation, for M=1 there is only a trivial solution using one number.
S(:,1)=num2cell(0:K);
for iM=2:M
for temporary_k=0:K
for new_element=0:temporary_k
h=S{temporary_k-new_element+1,iM-1};
h(:,end+1)=new_element;
S{temporary_k+1,iM}=[S{temporary_k+1,iM};h];
end
end
end
final_result=S{K+1,M}
This may be more efficient than your original approach, although it still generates (and then discards) more rows than needed.
Let M denote the number of columns, and S the desired sum. The problem can be interpreted as partitioning an interval of length S into M subintervals with non-negative integer lengths.
The idea is to generate not the subinterval lengths, but the subinterval edges; and from those compute the subinterval lengths. This can be done in the following steps:
The subinterval edges are M-1 integer values (not necessarily different) between 0 and S. These can be generated as a Cartesian product using for example this answer.
Sort the interval edges, and remove duplicate sets of edges. This is why the algorithm is not totally efficient: it produces duplicates. But hopefully the number of discarded tentative solutions will be less than in your original approach, because this does take into account the fixed sum.
Compute subinterval lengths from their edges. Each length is the difference between two consecutive edges, including a fixed initial edge at 0 and a final edge at S.
Code:
%// Data
S = 3; %// desired sum
M = 3; %// number of pieces
%// Step 1 (adapted from linked answer):
combs = cell(1,M-1);
[combs{end:-1:1}] = ndgrid(0:S);
combs = cat(M+1, combs{:});
combs = reshape(combs,[],M-1);
%// Step 2
combs = unique(sort(combs,2), 'rows');
%// Step 3
combs = [zeros(size(combs,1),1) combs repmat(S, size(combs,1),1)]
result = diff(combs,[],2);
The result is sorted in lexicographical order. In your example,
result =
0 0 3
0 1 2
0 2 1
0 3 0
1 0 2
1 1 1
1 2 0
2 0 1
2 1 0
3 0 0

Finding the most recent indices with different values

I am familiar with Matlab but am still having trouble with vectorized methods in my intuition, so I was wondering if anyone could demonstrate how they would manage this problem.
I have an array, for example A = [1 1 2 2 1 3 3 3 4 3 4 4 5].
I want to return an array B such that each element is the index of A's most 'recent' element with a different value than the previous ones.
So for our array A, B would equal [x x 2 2 4 5 5 5 8 9 10 10 12], where the x's can be any consistent value you like, because there is no previous index satisfying those characteristics.
I know how I would code it as a for-loop, and I bet the for-loop is probably faster, but can anyone vectorize this to faster than the for-loop?
Here's my for-loop:
prev=0;
B=zeros(length(A),1);
for i=2:length(A)
if A(i-1)~=A(i)
prev=i-1;
end
B(i)=prev;
end
Find the indices of the entries where the value changes:
ind = find(diff(A) ~= 0);
The values that should appear in B are therefore:
val = [0 ind];
Construct the diff of B: fill in the difference between the values that should appear at the right places:
Bd = zeros(size(B))';
Bd(ind + 1) = diff(val);
Now use cumsum to construct B:
B = cumsum(Bd)
Not sure whether this results in a speed-up though.

What does it mean to use logical indexing/masking to extract data from a matrix? (MATLAB)

I am new to matlab and I was wondering what it meant to use logical indexing/masking to extract data from a matrix.
I am trying to write a function that accepts a matrix and a user-inputted value to compute and display the total number of values in column 2 of the matrix that match with the user input.
The function itself should have no return value and will be called on later in another loop.
But besides all that hubbub, someone suggested that I use logical indexing/masking in this situation but never told me exactly what it was or how I could use it in my particular situation.
EDIT: since you updated the question, I am updating this answer a little.
Logical indexing is explained really well in this and this. In general, I doubt, if I can do a better job, given available time. However, I would try to connect your problem and logical indexing.
Lets declare an array A which has 2 columns. First column is index (as 1,2,3,...) and second column is its corresponding value, a random number.
A(:,1)=1:10;
A(:,2)=randi(5,[10 1]); //declares a 10x1 array and puts it into second column of A
userInputtedValue=3; //self-explanatory
You want to check what values in second column of A are equal to 3. Imagine as if you are making a query and MATLAB is giving you binary response, YES (1) or NO (0).
q=A(:,2)==3 //the query, what values in second column of A equal 3?
Now, for the indices where answer is YES, you want to extract the numbers in the first column of A. Then do some processing.
values=A(q,2); //only those elements will be extracted: 1. which lie in the
//second column of A AND where q takes value 1.
Now, if you want to count total number of values, just do:
numValues=length(values);
I hope now logical indexing is clear to you. However, do read the Mathworks posts which I have mentioned earlier.
I over simplified the code, and wrote more code than required in order to explain things. It can be achieved in a single-liner:
sum(mat(:,2)==userInputtedValue)
I'll give you an example that may illustrate what logical indexing is about:
array = [1 2 3 0 4 2];
array > 2
ans: [0 0 1 0 1 0]
using logical indexing you could filter elements that fullfil a certain condition
array(array>2) will give: [3 4]
you could also perform alterations to only those elements:
array(array>2) = 100;
array(array<=2) = 0;
will result in "array" equal to
[0 0 100 0 100 0]
Logical indexing means to have a logical / Boolean matrix that is the same size as the matrix that you are considering. You would use this as input into the matrix you're considering, and any locations that are true would be part of the output. Any locations that are false are not part of the output. To perform logical indexing, you would need to use logical / Boolean operators or conditions to facilitate the selection of elements in your matrix.
Let's concentrate on vectors as it's the easiest to deal with. Let's say we had the following vector:
>> A = 1:9
A =
1 2 3 4 5 6 7 8 9
Let's say I wanted to retrieve all values that are 5 or more. The logical condition for this would be A >= 5. We want to retrieve all values in A that are greater than or equal to 5. Therefore, if we did A >= 5, we get a logical vector which tells us which values in A satisfy the above condition:
>> A >= 5
ans =
0 0 0 0 1 1 1 1 1
This certainly tells us where in A the condition is satisfied. The last step would be to use this as input into A:
>> B = A(A >= 5)
B =
5 6 7 8 9
Cool! As you can see, there isn't a need for a for loop to help us select out elements that satisfy a condition. Let's go a step further. What if I want to find all even values of A? This would mean that if we divide by 2, the remainder would be zero, or mod(A,2) == 0. Let's extract out those elements:
>> C = A(mod(A,2) == 0)
C =
2 4 6 8
Nice! So let's go back to your question. Given your matrix A, let's extract out column 2.
>> col = A(:,2)
Now, we want to check to see if any of column #2 is equal to a certain value. Well we can generate a logical indexing array for that. Let's try with the value of 3:
>> ind = col == 3;
Now you'll have a logical vector that tells you which locations are equal to 3. If you want to determine how many are equal to 3, you just have to sum up the values:
>> s = sum(ind);
That's it! s contains how many values were equal to 3. Now, if you wanted to write a function that only displayed how many values were equal to some user defined input and displayed this event, you can do something like this:
function checkVal(A, val)
disp(sum(A(:,2) == val));
end
Quite simply, we extract the second column of A and see how many values are equal to val. This produces a logical array, and we simply sum up how many 1s there are. This would give you the total number of elements that are equal to val.
Troy Haskin pointed you to a very nice link that talks about logical indexing in more detail: http://www.mathworks.com/help/matlab/math/matrix-indexing.html?refresh=true#bq7eg38. Read that for more details on how to master logical indexing.
Good luck!
%% M is your Matrix
M = randi(10,4)
%% Val is the value that you are seeking to find
Val = 6
%% Col is the value of the matrix column that you wish to find it in
Col = 2
%% r is a vector that has zeros in all positions except when the Matrix value equals the user input it equals 1
r = M(:,Col)==Val
%% We can now sum all the non-zero values in r to get the number of matches
n = sum(r)
M =
4 2 2 5
3 6 7 1
4 4 1 6
5 8 7 8
Val =
6
Col =
2
r =
0
1
0
0
n =
1

Statistical error computing Matlab

I have two vectors of values and I want to compare them statistically. For simplicity assume A = [2 3 5 10 15] and B = [2.5 3.1 4.8 10 18]. I want to compute the standard deviation, the root mean square error (RMSE), the mean, and present conveniently, maybe as histogram. Can you please help me how to do it so that I understand? I know question is probably simple, but I am new into this. Many thanks!
edited:
This is how I wanted to implement RMSE.
dt = 1;
for k=1:numel(A)
err(k)=sqrt(sum(A(1,1:k)-B(1,1:k))^2/k);
t(k) = dt*k;
end
However it gives me bigger values than I expect, since e.g. 3 and 3.1 differ only in 0.1.
This is how I calculate error between reference value of each cycle with corresponding estimated in that cycle.
Can you tell me, am I doing right, or what's wrong?
abs_err = A-B;
The way you are looping through the vectors is not element by element but rather by increasing the vector length, that is, you are comparing the following at each iteration:
A(1,1:k) B(1,1:k)
-------- --------
k=1 [2] [2.5]
=2 [2 3] [2.5 3.1]
=3 [2 3 5] [2.5 3.1 4.8]
....
At no point do you compare only 2 and 2.1!
Assuming A and B are vectors of identical length (and both are either column or row vectors), then you want functions std(A-B), mean(A-B), and if you look in matlab exchange, you will find a user-contributed rmse(A-B) but you can also compute the RMSE as sqrt(mean((A-B).^2)). As for displaying a histogram, try hist(A-B).
In your case:
dt = 1;
for k=1:numel(A)
stdab(k) = std(A(1,1:k)-B(1,1:k));
meanab(k) = mean(A(1,1:k)-B(1,1:k));
err(k)=sqrt(mean((A(1,1:k)-B(1,1:k)).^2));
t(k) = dt*k;
end
You can also include hist(A(1,1:k)-B(1,1:k)) in the loop if you want to compute histograms for every vector pair difference A(1,1:k)-B(1,1:k).