How to obtain jaccard similarity in matlab - matlab

I have a table:
x y z
A 2 0 3
B 0 3 0
C 0 0 4
D 1 4 0
I want to calculate the Jaccard similarity in Matlab, between the vectors A, B, C and D.
The formula is :
In this formula |x| and |y| indicates the number of items which are not zero. For example |A| number of items that is not zero is 2, for |B| and |C| it is 1, and for |D| it is 2.
|x intersect y| indicates the number of common items which are not zero. |A intersect B| is 0. |A intersect D| is 1, because the value of x in both is not zero.
e.g.: jaccard(A,D)= 1/3=0.33
How can I implement this in Matlab?

Matlab has a built-in function that computes the Jaccard distance: pdist.
Here is some code
X = rand(2,100);
X(X>0.5) = 1;
X(X<=0.5) = 0;
JD = pdist(X,'jaccard') % jaccard distance
JI = 1 - JD; % jaccard index
EDIT
A calculation that does not require the statistic toolbox
a = X(1,:);
b = X(2,:);
JD = 1 - sum(a & b)/sum(a | b)

Related

How to split the dataset in a stratified way (Matlab)?

I have a 30x500 data matrix and a 3x500 target matrix. This is a classification problem. I need to divide the data into training, validation and testing (80%, 10%, 10%), but I want to maintain the proportion of each class in the divided data. How can I do this in Matlab?
Edit:
The target matrix contains the labels (one hot) of the correct class (there are three classes)
|0 0 1 ... 1|
|1 0 0 ... 0|
|0 1 0 ... 0|3x500
The data matrix contains 500 samples with 30 predictor variables (30x500).
|2 0 1 4 8 1 ... 2|
|4 1 5 8 7 3 ... 0|
|1 3 6 4 2 1 ... 6|
|. . . . . . . . .|
|3 5 8 4 0 0. .. 1| 30x500
You can manage to inmpose the number of each class by computing the percentiles associated to the cumulative probabilities, and then associate each interval to the associated interval. Let's use this strategy to first create a random dataset containing exactly 50% of class 1, 30% of class 2, and 20% of class 3 (not that you don't have to do that in your case as you already have the class matrix classmat and data matrix datamat):
clc; clear all; close all;
% Parameters
c = 3; % number of classes
d = 500; % number of data
o = 30; % number of data for each observation
propd = [0.4, 0.2, 0.4]; % proportions of each class in the original data (size 1xc)
% Generation of fake data
datamat = randi([0,15],o,d); % test data matrix
propd_os = cumsum([0, propd]); propd_os(end) = 1;
randmat = rand(d,1);
classmat = zeros(c,d); % test class matrix
for i=1:c
prctld = prctile(randmat, 100*propd_os); prctld(1) = 0; prctld(end) = 1;
classmat(i,randmat>=prctld(i) & randmat<prctld(i+1)) = 1;
end
% Proportions of the original data
disp(['original data proportions:' sprintf(' %.3f',sum(classmat,2)/d)])
When executed, this code creates the classmat matrix and displays the proportions of each class in this matrix:
>>>
original data proportions: 0.400 0.200 0.400
I created for you a script to split this dataset into parts that respects the same proportions of the original dataset:
%% Splitting parameters
s = 3; % number of parts
props = [0.8, 0.1, 0.1]; % proportions of each splitted datasets (size 1xs)
props_os = cumsum([0, props]); props_os(end) = 1;
randmat = rand(1,d);
splitmat = zeros(s,d); % split matrix (one hot)
for i=1:c
indc = classmat(i,:)==1;
prctls = prctile(randmat(indc), 100*props_os); prctls(1) = 0; prctls(end) = 1;
for j=1:s
inds = randmat>=prctls(j) & randmat<prctls(j+1);
splitmat(j,indc&inds) = 1;
end
end
% Proportions of classes in each split parts
disp(['original split proportions:' sprintf(' %.3f',sum(splitmat,2)/d)])
for j=1:s
inds = splitmat(j,:)==1;
disp([sprintf('part %d proportions:', j) sprintf(' %.3f',sum(classmat(:,inds),2)/sum(inds))])
end
You obtain with this 3 parts containing 80%, 10% and 10%. Each of these have the same proportion of each class than the original dataset:
>>>
original split proportions: 0.800 0.100 0.100
part 1 proportions: 0.400 0.200 0.400
part 2 proportions: 0.400 0.200 0.400
part 3 proportions: 0.400 0.200 0.400
Note that you cannot always obtain the exact proportions beacause it depends on the size of the datasets and their divisibility by the inverses of the proportions... But I think it should do what you want. Please do not hesitate to ask question if you have troubles with the code. At the end you obtain a one hot splitmat matrix. splitmat(s,d) equals 1 if datapoint d belongs to part s.

Optimization under constraints

I have a question regarding optimization.
I have a matrix x with 3 columns and a certain number of rows (max 200). Each row represents a candidate. The column one contains a score (between 0 and 1) , the column 2 contains the kind of candidate (there are 10 kinds in total labeled from 1 to 10) and the column 3 contains the amount of each candidate. There is one thing to take into consideration: the amount can be NEGATIVE
What I would like to do is to select max 35 elements among these candidates which would maximize the function which sum over their respective score (column 1) under the constraints that there can be a maximum of 10% of each kind computed in the following way: percenteage of kind 1: sum amount of kind 1 divided by sum all amount.
At the end, I would like to have a set of max 35 candidates which satisfy the constraints and optimize the sum of their scores.
Here is a the code I have come up with so far but I am struggling on the 10% constraint as it seems not to be taken into account:
rng('default');
clc;
clear;
n = 100;
maxSize = 35;
%%%TOP BASKET
nbCandidates = 100;
score = rand(100,1)/10+0.9;
quantity = rand(100,1)*100000;
type = ceil(rand(100,1)*10)
typeMask = zeros(n,10);
for i=1:10
typeMask(:,i) = type(:,1) == i;
end
fTop = -score;
intconTop = [1:1:n];
%Write the linear INEQUALITY constraints:
A = [ones(1,n);bsxfun(#times,typeMask,quantity)'/sum(type.*quantity)];
b = [maxSize;0.1*ones(10,1)];
%Write the linear EQUALITY constraints:
Aeq = [];
beq = [];
%Write the BOUND constraints:
lb = zeros(n,1);
ub = ones(n,1); % Enforces i1,i2,...in binary
x = intlinprog(fTop,intconTop,A,b,Aeq,beq,lb,ub);
I would be grateful to some advice where I m doing it wrong!
A linear program for your model might look something like this:
n is the number of candidates.
S[x] is candidate x's score.
A[i][x] is the amount of candidate x for kind i (A[i][x] can be positive or negative, like you said).
T[i] is the total amount of all candidates for kind i.
I[x] is 1 if element x is to be included, and 0 if element x is to be excluded.
The function f which you want to optimize is a function of S[x] and I[x]. You can think of S and I as n-dimensional vectors, so the function you want to optimize is their dot-product.
f() = DotProduct(I, S)
This is equivalent to the linear function I1 * S1 + I2 * S2 + ... + In * Sn.
We can formulate all of the constraints in this way to get a set of linear functions whose coeffecients are the components in an n dimensional vector that we can dot with I, the parameters to optimize.
For the constraint that we can only take 35 elements at most, let C1() be a function which computes the total number of elements.
Then the first constraint can be formalized as C1() <= 35 and C1() is a linear function which can be computed thusly:
Let j be an n dimensional vector with each component equal to 1: j = <1,1,...,1>.
C1() = DotProduct(I, j)
So C1() <= 35 is a linear inequality equivalent to:
I1 * 1 + I2 * 1 + ... + In * 1 <= 35
I1 + I2 + ... + In <= 35
We need to add a slack variable x1 here to turn this into and equivalence relation:
I1 + I2 + ... + In + x1 = 35
For the constraint that we can only take 10% of each kind, we will have a function C2[i]() for each kind i (you said there are 10 in all). C2[i]() Computes the amount of students taken for kind i given the students we have selected:
C21() <= .1 * T1
C22() <= .1 * T2
...
C210() <= .1 * T10
We compute C2[i]() like this:
Let k be an n dimensional vector equal to <A[i]1, A[i]2, ..., A[i]n>, each component is the amount of each candidate for kind i.
Then DotProduct(I, k) = I1 * A[i]1 + I2 * A[i]2 + ... + In * A[i]n, is the total amount we are taking of kind i given I, the vector which captures what elements we are including.
So C2[i]() = DotProduct(I, k)
Now that we know how to compute C2[i](), we need to add a slack variable to turn this into an equality relation:
C2[i]() + x[i + 1] = .1 * T[i]
Here x's subscript is [i + 1] because x1 is already used as a slack variable for the previous constraint.
In summary, the linear program would look like this (adding 11 slack variables x1, x2, ..., x11 for each constraint that is an inequality):
Let:
V = <I1, I2, ..., In, x1, x2, ..., x11> (variables)
|S1|
|S2|
|. |
|. |
|. |
P = |Sn| (parameters of objective function)
|0 |
|0 |
|. |
|. |
|. |
|0 |
|35 |
|.1*T1 |
C = |.1*T2 | (right-hand sides of constraining equality relations)
|... |
|.1*T10|
|1 |1 |...|1 |1|0|...|0|
|A1,1 |A1,2 |...|A1,n |0|1|...|0|
CP = |A2,1 |A2,2 |...|A2,n |0|0|...|0| (parameters of constraint functions)
|... |... |...|... |0|0|...|0|
|A10,1|A10,2|...|A10,n|0|0|...|1|
Maximize:
V x P
Subject to:
CP x Transpose(V) = C
Hopefully this is clear, sorry for terrible formatting.
I believe the MIP model can look like:
Here i are the data points and j indicates the type. For simplicity I assumed here every type has the same number of data points (i.e. Amount(i,j), Score(i,j) are matrices). It is easy to handle the more irregular case by restricting the summations.
The 10% rule is simply applied on the sum of the amounts. I hope that is the correct interpretation. Not sure if this is true if we have negative sums.

How can I merge together two co-occurrence matrices with overlapping but not identical vocabularies?

I'm looking at word co-occurrence in a number of documents. For each set of documents, I find a vocabulary of the N most frequent words. I then make an NxN matrix for each document representing whether the words occur together in the same context window (sequence of k words). This is a sparse matrix, so if I have M documents, I have an NxNxM sparse matrix. Because Matlab cannot store sparse matrices with more than 2 dimensions, I flatten this matrix into a (NxN)xM sparse matrix.
I face the problem that I generated 2 of these co-occurrence matrices for different sets of documents. Because the sets were different, the vocabularies are different. Instead of merging the sets of documents together and recalculating the co-occurrence matrix, I'd like to merge the two existing matrices together.
For example,
N = 5; % Size of vocabulary
M = 5; % Number of documents
A = ones(N*N, M); % A is a flattened (N, N, M) matrix
B = 2*ones(N*N, M); % B is a flattened (N, N, M) matrix
A_ind = {'A', 'B', 'C', 'D', 'E'}; % The vocabulary labels for A
B_ind = {'A', 'F', 'B', 'C', 'G'}; % The vocabulary labels for B
Should merge to produce a (49, 5) matrix, where each (49, 1) slice that can be reshaped into a (7,7) matrix with the following structure.
A B C D E F G
__________________________________________
A| 3 3 3 1 1 2 2
B| 3 3 3 1 1 2 2
C| 3 3 3 1 1 2 2
D| 1 1 1 1 1 0 0
E| 1 1 1 1 1 0 0
F| 2 2 2 0 0 2 2
G| 2 2 2 0 0 2 2
Where A and B overlap, the co-occurrence counts should be added together. Otherwise, the elements should be the counts from A or the counts from B. There will be some elements (0's in the example) where I don't have count statistics because some of the vocabulary is exclusively in A and some is exclusively in B.
The key is to use the ability of logical indices to be flattened.
A = ones(25, 5);
B = 2*ones(25,5);
A_ind = {'A', 'B', 'C', 'D', 'E'};
B_ind = {'A', 'F', 'B', 'C', 'G'};
new_ind = [A_ind, B_ind(~ismember(B_ind, A_ind))];
new_size = length(new_ind)^2;
new_array = zeros(new_size, 5);
% Find the indices that correspond to elements of A
A_overlap = double(ismember(new_ind, A_ind));
A_mask = (A_overlap'*A_overlap)==1;
% Find the indices that correspond to elements of B
B_overlap = double(ismember(new_ind, B_ind));
B_mask = (B_overlap'*B_overlap)==1;
% Flatten the logical indices to assign the elements to the new array
new_array(A_mask(:), :) = A;
new_array(B_mask(:), :) = new_array(B_mask(:), :) + B;

Finding sub-matrix with minimum elementwise sum

I have a symmetric m-by-m matrix A. Each element has a value between 0 and 1. I now want to choose n rows / columns of A which form an n-by-n sub-matrix B.
The criteria for choosing these elements, is that the sum of all elements of B must be the minimum out of all possible n-by-n sub-matrices of A.
For example, suppose that A is a 4-by-4 matrix:
A = [0 0.5 1 0; 0.5 0 0.5 0; 1 0.5 1 1; 0 0 1 0.5]
And n is set to 3. Then, the best B is the one taking the first, second and fourth rows / columns of A:
B = [0 0.5 0; 0.5 0 0; 0 0 0.5]
Where the sum of these elements is 0 + 0.5 + 0 + 0.5 + 0 + 0 + 0 + 0 + 0.5 = 1.5, which is smaller than another other possible 3-by-3 sub-matrices (e.g. using the first, third and fourth rows / columns).
How can I do this?
This is partly a mathematics question, and partly a Matlab one. Any help with either would be great!
Do the following:
m = size(A,1);
n=3;
sub = nchoosek(1:m,n); % (numCombinations x n)
subR = permute(sub,[2,3,1]); % (n x 1 x numCombinations), row indices
subC = permute(sub,[3,2,1]); % (1 x n x numCombinations), column indices
lin = bsxfun(#plus,subR,m*(subC-1)); % (n x n x numCombinations), linear indices
allB = A(lin); % (n x n x numCombinations), all possible Bs
sumB = sum(sum(allB,1),2); % (1 x 1 x numCombinations), sum of Bs
sumB = squeeze(sumB); % (numCombinations x 1), sum of Bs
[minB,minBInd] = min(sumB);
fprintf('Indices for minimum B: %s\n',mat2str(sub(minBInd,:)))
fprintf('Minimum B: %s (Sum: %g)\n',mat2str(allB(:,:,minBInd)),minB)
This looks only for submatrices where the row indices are the same as the column indices, and not necessarily consecutive. That is how I understood the question.
This is a bit brute force, but should work
A = [0 0.5 1 0; 0.5 0 0.5 0; 1 0.5 1 1; 0 0 1 0.5];
sizeA = size(A,1);
size_sub=3;
idx_combs = nchoosek(1:sizeA, size_sub);
for ii=1:size(idx_combs,1)
sub_temp = A(idx_combs(ii,:),:);
sub = sub_temp(:,idx_combs(ii,:));
sum_temp = sum(sub);
sums(ii) = sum(sum_temp);
end
[min_set, idx] = min(sums);
sub_temp = A(idx_combs(idx,:),:);
sub = sub_temp(:,idx_combs(idx,:))
Try to convolve the matrix A with a smaller matrix M. Eg if you is interested in finding the 3x3 submatrix then let M be ones(3). This code shows how it works.
A = toeplitz(10:-1:1) % Create a to eplitz matrix (example matrix)
m = 3; % Submatrix size
mC = ceil(m/2); % Distance to center of submatrix
M = ones(m);
Aconv = conv2(A,M); % Do the convolution.
[~,minColIdx] = min(min(Aconv(1+mC:end-mC,1+mC:end-mC))); % Find column center with smallest sum
[~,minRowIdx] = min(min(Aconv(1+mC:end-mC,minColIdx+mC),[],2)); % Find row center with smlest sum
minRowIdx = minRowIdx+mC-1 % Convoluted matrix is larger than A
minColIdx = minColIdx+mC-1 % Convoluted matrix is larger than A
range = -mC+1:mC-1
B = A(minRowIdx+range, minColIdx+range)
The idea is to imitate a fir filter y(n) = 1*x(n-1)+1*x(n)+1*x(n+1). For now it only finds the first smallest matrix though. Notice the +1 adjustment because first matrix element is 1. Then notice the the restoration right below.

Compute all differences possibilities in a vector

Let's say I have a short vector x = [a,b,c,d,e]; What would be the best way to compute all the difference between members of the vector as:
y = [e-d e-c e-b e-a
d-e d-c d-b d-a
c-e c-d c-b c-a
b-e b-d b-c b-a
a-e a-d a-c a-b];
Thanks in advance
To give that exact matrix, try:
x = [1;2;3;4;5]; %# note this is a column vector (matrix of rows in general)
D = squareform( pdist(x,#(p,q)q-p) );
U = triu(D);
L = tril(D);
y = flipud(fliplr( L(:,1:end-1) - U(:,2:end) ))
result in this case:
y =
1 2 3 4
-1 1 2 3
-2 -1 1 2
-3 -2 -1 1
-4 -3 -2 -1
First creat a circulant matrix, then compute the different between the first column and the rest columns. Here is a reference for creating a circulant matrix