How to split the dataset in a stratified way (Matlab)?

How to split the dataset in a stratified way (Matlab)? - matlab

I have a 30x500 data matrix and a 3x500 target matrix. This is a classification problem. I need to divide the data into training, validation and testing (80%, 10%, 10%), but I want to maintain the proportion of each class in the divided data. How can I do this in Matlab?
Edit:
The target matrix contains the labels (one hot) of the correct class (there are three classes)
|0 0 1 ... 1|
|1 0 0 ... 0|
|0 1 0 ... 0|3x500
The data matrix contains 500 samples with 30 predictor variables (30x500).
|2 0 1 4 8 1 ... 2|
|4 1 5 8 7 3 ... 0|
|1 3 6 4 2 1 ... 6|
|. . . . . . . . .|
|3 5 8 4 0 0. .. 1| 30x500

You can manage to inmpose the number of each class by computing the percentiles associated to the cumulative probabilities, and then associate each interval to the associated interval. Let's use this strategy to first create a random dataset containing exactly 50% of class 1, 30% of class 2, and 20% of class 3 (not that you don't have to do that in your case as you already have the class matrix classmat and data matrix datamat):
clc; clear all; close all;
% Parameters
c = 3; % number of classes
d = 500; % number of data
o = 30; % number of data for each observation
propd = [0.4, 0.2, 0.4]; % proportions of each class in the original data (size 1xc)
% Generation of fake data
datamat = randi([0,15],o,d); % test data matrix
propd_os = cumsum([0, propd]); propd_os(end) = 1;
randmat = rand(d,1);
classmat = zeros(c,d); % test class matrix
for i=1:c
prctld = prctile(randmat, 100*propd_os); prctld(1) = 0; prctld(end) = 1;
classmat(i,randmat>=prctld(i) & randmat<prctld(i+1)) = 1;
end
% Proportions of the original data
disp(['original data proportions:' sprintf(' %.3f',sum(classmat,2)/d)])
When executed, this code creates the classmat matrix and displays the proportions of each class in this matrix:
>>>
original data proportions: 0.400 0.200 0.400
I created for you a script to split this dataset into parts that respects the same proportions of the original dataset:
%% Splitting parameters
s = 3; % number of parts
props = [0.8, 0.1, 0.1]; % proportions of each splitted datasets (size 1xs)
props_os = cumsum([0, props]); props_os(end) = 1;
randmat = rand(1,d);
splitmat = zeros(s,d); % split matrix (one hot)
for i=1:c
indc = classmat(i,:)==1;
prctls = prctile(randmat(indc), 100*props_os); prctls(1) = 0; prctls(end) = 1;
for j=1:s
inds = randmat>=prctls(j) & randmat<prctls(j+1);
splitmat(j,indc&inds) = 1;
end
end
% Proportions of classes in each split parts
disp(['original split proportions:' sprintf(' %.3f',sum(splitmat,2)/d)])
for j=1:s
inds = splitmat(j,:)==1;
disp([sprintf('part %d proportions:', j) sprintf(' %.3f',sum(classmat(:,inds),2)/sum(inds))])
end
You obtain with this 3 parts containing 80%, 10% and 10%. Each of these have the same proportion of each class than the original dataset:
>>>
original split proportions: 0.800 0.100 0.100
part 1 proportions: 0.400 0.200 0.400
part 2 proportions: 0.400 0.200 0.400
part 3 proportions: 0.400 0.200 0.400
Note that you cannot always obtain the exact proportions beacause it depends on the size of the datasets and their divisibility by the inverses of the proportions... But I think it should do what you want. Please do not hesitate to ask question if you have troubles with the code. At the end you obtain a one hot splitmat matrix. splitmat(s,d) equals 1 if datapoint d belongs to part s.

Related

Generate cell with random pairs without repetitions

How to generate a sequence of random pairs without repeating pairs?
The following code already generates the pairs, but does not avoid repetitions:
for k=1:8
Comb=[randi([-15,15]) ; randi([-15,15])];
T{1,k}=Comb;
end
When running I got:
T= [-3;10] [5;2] [1;-5] [10;9] [-4;-9] [-5;-9] [3;1] [-3;10]
The pair [-3,10] is repeated, which cannot happen.
PS : The entries can be positive or negative.
Is there any built in function for this? Any sugestion to solve this?

If you have the Statistics Toolbox, you can use randsample to sample 8 numbers from 1 to 31^2 (where 31 is the population size), without replacement, and then "unpack" each obtained number into the two components of a pair:
s = -15:15; % population
M = 8; % desired number of samples
N = numel(s); % population size
y = randsample(N^2, M); % sample without replacement
result = s([ceil(y/N) mod(y-1, N)+1]); % unpack pair and index into population
Example run:
result =
14 1
-5 7
13 -8
15 4
-6 -7
-6 15
2 3
9 6

You can use ind2sub:
n = 15;
m = 8;
[x y]=ind2sub([n n],randperm(n*n,m));

Two possibilities:
1.
M = nchoosek(1:15, 2);
T = datasample(M, 8, 'replace', false);
2.
T = zeros(8,2);
k = 1;
while (k <= 8)
t = randi(15, [1,2]);
b1 = (T(:,1) == t(1));
b2 = (T(:,2) == t(2));
if ~any(b1 & b2)
T(k,:) = t;
k = k + 1;
end
end
The first method is probably faster but takes up more memory and may not be practicable for very large numbers (ex: if instead of 15, the max was 50000), in which case you have to go with 2.

Matlab get all possible combinations less than a value

I have a matrix as follows:
id value
=============
1 0.5
2 0.5
3 0.8
4 0.3
5 0.2
From this array, I wish to find all the possible combinations that have a sum less than or equal to 1. That is,
result
======
1 2
1 4 5
2 4 5
3 5
1 5
1 4
2 4
2 5
...
In order to get the above result, my idea has been to initially compute all the possibilities of finding sum of elements in the array, like so:
for ii = 1 : length(a) % compute number of possibilities
no_of_possibilities = no_of_possibilities + nchoosek(length(a),ii);
end
Once this is done, then loop through all possible combinations.
I would like to know if there's an easier way of doing this.

data = [0.5, 0.5, 0.8, 0.3, 0.2];
required = cell(1, length(data));
subsets = cell(1, length(data));
for k = 2:length(data)-1 % removes trivial cases (all numbers or one number at a time)
% generate all possible k-pairs (if k = 3, then all possible triplets
% will be generated)
combination = nchoosek(1:length(data), k);
% for every triplet generated, this function sums the corresponding
% values and then decides whether then sum is less than equal to 1 or
% not
findRequired = #(x) sum(data(1, combination(x, :))) <= 1;
% generate a logical vector for all possible combinations like [0 1 0]
% which denotes that the 2nd combination satisfies the condition while
% the others do not
required{k} = arrayfun(findRequired, 1:size(combination, 1));
% access the corresponding combinations from the entire set
subsets{k} = combination(required{k}, :);
end
This produces the following subsets:
1 2
1 4
1 5
2 4
2 5
3 5
4 5
1 4 5
2 4 5

It is not in easy way, however is a faster way, as I removed the combination which its subsets are not passed the condition.
bitNo = length(A); % number of bits
setNo = 2 ^ bitNo - 1; % number of sets
subsets = logical(dec2bin(0:setNo, bitNo) - '0'); % all subsets
subsets = subsets(2:end,:); % all subsets minus empty set!
subsetCounter = 1;
resultCounter = 1;
result = {};
while(1)
if( subsetCounter >= size(subsets,1))
break;
end
if(sum(A(subsets(subsetCounter,:).',2)) <= 1)
result{resultCounter} = A(subsets(subsetCounter,:).',1).';
resultCounter = resultCounter + 1;
subsetCounter = subsetCounter + 1;
else
% remove all bad cases related to the current subset
subsets = subsets(sum((subsets & subsets(subsetCounter,:)) - subsets(subsetCounter,:),2) ~= 0,:);
end
end
Generate the subsets using this method. After that, check the condition for each subset. If the subset does not pass the condition, all its supersets are removed from the subsets. To do this, using sum((subsets & subsets(i,:)) - subsets(i,:),2) ~= 0 which mean get some rows from subsets which has not the same elements of the not passed subset. By doing this, we able to not to consider some bad cases anymore. Although, theoretically, this code is Θ(2^n).

Here is potential solution, using inefficient steps, but borrowing efficient code from various SO answers. Credit goes to those original peeps.
data = [0.5, 0.5, 0.8, 0.3, 0.2];
First get all combinations of indices, not necessarily using all values.
combs = bsxfun(#minus, nchoosek(1:numel(data)+numel(data)-1,numel(data)), 0:numel(data)-1);
Then get rid of repeated indices in each combination, regardless of index order
[ii, ~, vv] = find(sort(combs,2));
uniq = accumarray(ii(:), vv(:), [], #(x){unique(x.')});
Next get unique combinations, regardless of index order... NOTE: You can do this step much more efficiently by restructuring the steps, but it'll do.
B = cellfun(#mat2str,uniq,'uniformoutput',false);
[~,ia] = unique(B);
uniq=uniq(ia);
Now sum all values in data based on cell array (uniq) of index combinations
idx = cumsum(cellfun('length', uniq));
x = diff(bsxfun(#ge, [0; idx(:)], 1:max(idx)));
x = sum(bsxfun(#times, x', 1:numel(uniq)), 2); %'// Produce subscripts
y = data([uniq{:}]); % // Obtain values
sums_data = accumarray(x, y);
And finally only keep the index combinations that sum to <= 1
allCombLessThanVal = uniq(sums_data<=1)

Eliminate/Remove duplicates from array Matlab

How can I remove any number that has duplicate from an array.
for example:
b =[ 1 1 2 3 3 5 6]
becomes
b =[ 2 5 6]

Use unique function to extract unique values then compute histogram of data for unique values and preserve those that have counts of 1.
a =[ 1 1 2 3 3 5 6];
u = unique(a)
idx = hist(a, u) ==1;
b = u(idx)
result
2 5 6
for multi column input this can be done:
a = [1 2; 1 2;1 3;2 1; 1 3; 3 5 ; 3 6; 5 9; 6 10] ;
[u ,~, uid] = unique(a,'rows');
idx = hist(uid,1:size(u,1))==1;
b= u(idx,:)

You can first sort your elements and afterwards remove all elements which have the same value as one of its neighbors as follows:
A_sorted = sort(A); % sort elements
A_diff = diff(A_sorted)~=0; % check if element is the different from the next one
A_unique = [A_diff true] & [true A_diff]; % check if element is different from previous and next one
A = A_sorted(A_unique); % obtain the unique elements.
Benchmark
I will benchmark my solution with the other provided solutions, i.e.:
using diff (my solution)
using hist (rahnema1)
using sum (Jean Logeart)
using unique (my alternative solution)
I will use two cases:
small problem (yours): A = [1 1 2 3 3 5 6];
larger problem
rng('default');
A= round(rand(1, 1000) * 300);
Result:
Small Large Comments
----------------|------------|------------%----------------
using `diff` | 6.4080e-06 | 6.2228e-05 % Fastest method for large problems
using `unique` | 6.1228e-05 | 2.1923e-04 % Good performance
using `sum` | 5.4352e-06 | 0.0020 % Only fast for small problems, preserves the original order
using `hist` | 8.4408e-05 | 1.5691e-04 % Good performance
My solution (using diff) is the fastest method for somewhat larger problems. The solution of Jean Logeart using sum is faster for small problems, but the slowest method for larger problems, while mine is almost equally fast for the small problem.
Conclusion: In general, my proposed solution using diff is the fastest method.
timeit(#() usingDiff(A))
timeit(#() usingUnique(A))
timeit(#() usingSum(A))
timeit(#() usingHist(A))
function A = usingDiff (A)
A_sorted = sort(A);
A_unique = [diff(A_sorted)~=0 true] & [true diff(A_sorted)~=0];
A = A_sorted(A_unique);
end
function A = usingUnique (A)
[~, ia1] = unique(A, 'first');
[~, ia2] = unique(A, 'last');
A = A(ia1(ia1 == ia2));
end
function A = usingSum (A)
A = A(sum(A==A') == 1);
end
function A = usingHist (A)
u = unique(A);
A = u(hist(A, u) ==1);
end

How should I average groups of rows in a matrix to produce a new, smaller matrix?

I have a very large matrix (216 rows, 31286 cols) of doubles. For reasons specific to the data, I want to average every 9 rows to produce one new row. So, the new matrix will have 216/9=24 rows.
I am a Matlab beginner so I was wondering if this solution I came up with can be improved upon. Basically, it loops over every group, sums up the rows, and then divides the new row by 9. Here's a simplified version of what I wrote:
matrix_avg = []
for group = 1:216/9
new_row = zeros(1, 31286);
idx_low = (group - 1) * 9 + 1;
idx_high = idx_low + 9 - 1;
% Add the 9 rows to new_row
for j = idx_low:idx_high
new_row = new_row + M(j,:);
end
% Compute the mean
new_row = new_row ./ 9
matrix_avg = [matrix_avg; new_row];
end

You can reshape your big matrix from 216 x 31286 to 9 x (216/9 * 31286).
Then you can use mean, which operates on each column. Since your matrix only has 9 rows per column, this takes the 9-row average.
Then you can just reshape your matrix back.
% generate big matrix
M = rand([216 31286]);
n = 9 % want 9-row average.
% reshape
tmp = reshape(M, [n prod(size(M))/n]);
% mean column-wise (and only 9 rows per col)
tmp = mean(tmp);
% reshape back
matrix_avg = reshape(tmp, [ size(M,1)/n size(M,2) ]);
In a one-liner (but why would you?):
matrix_avg = reshape(mean(reshape(M,[n prod(size(M))/n])), [size(M,1)/n size(M,2)]);
Note - this will have problems if the number of rows in M isn't exactly divisible by 9, but so will your original code.

I measured the 4 solutions and here are the results:
reshape: Elapsed time is 0.017242 seconds.
blockproc [9 31286]: Elapsed time is 0.242044 seconds.
blockproc [9 1]: Elapsed time is 44.477094 seconds.
accumarray: Elapsed time is 103.274071 seconds.
This is the code I used:
M = rand(216,31286);
fprintf('reshape: ');
tic;
n = 9;
matrix_avg1 = reshape(mean(reshape(M,[n prod(size(M))/n])), [size(M,1)/n size(M,2)]);
toc
fprintf('blockproc [9 31286]: ');
tic;
fun = #(block_struct) mean(block_struct.data);
matrix_avg2 = blockproc(M,[9 31286],fun);
toc
fprintf('blockproc [9 1]: ');
tic;
fun = #(block_struct) mean(block_struct.data);
matrix_avg3 = blockproc(M,[9 1],fun);
toc
fprintf('accumarray: ');
tic;
[nR,nC] = size(M);
n2average = 9;
[xx,yy] = ndgrid(1:nR,1:nC);
x = ceil(xx/n2average); %# makes xx 1 1 1 1 2 2 2 2 etc
matrix_avg4 = accumarray([xx(:),yy(:)],M(:),[],#mean);
toc

Here's an alternative based on accumarray. You create an array with row and column indices into matrix_avg that tells you which element in matrix_avg a given element in M contributes to, then you use accumarray to average the elements that contribute to the same element in matrix_avg. This solution works even if the number of rows in M is not divisible by 9.
M = rand(216,31286);
[nR,nC] = size(M);
n2average = 9;
[xx,yy] = ndgrid(1:nR,1:nC);
x = ceil(xx/n2average); %# makes xx 1 1 1 1 2 2 2 2 etc
matrix_avg = accumarray([xx(:),yy(:)],M(:),[],#mean);

MATLAB function matrix parameter

I've seen a blog post about computing the K-nearest neighbor as follows:
function test_targets = knn(train_patterns, train_targets, test_patterns, K)
% Hubungi budi santosa di budi_s#ie.its.ac.id
% untuk laporan kesalahan (bug).
% Implementasi the Nearest neighbor algorithm
% Inputs:
% train_patterns - Train patterns (obs x dim) D x N
% train_targets - Train targets 1 x N (classes)
% test_patterns - Test patterns D x M (M testing)
% K - jumlah nearest neighbors
%
% Outputs
% test_targets - Predicted targets
L = length(train_targets);
Uc = unique(train_targets);
if (L < K),
error(’tetangga lebih banyak dari jumlah titik training’)
end
N = size(test_patterns, 1);
test_targets = zeros(N,1);
for i = 1:N,
jar=(train_patterns - repmat(test_patterns(i,:),L,1)).^2;
dist = sum(jar,2);%jarak tiap titik data test terhadap data training
[m, indices] = sort(dist);%urutkan jarak dr yg terkecil
yt=train_targets(indices(1:K));%ambil K jarak terkecil dan periksa labelnya
n = hist(yt, Uc);%menempatkan data testing ke kelas mana (tergantung Uc)
[m, best] = max(n);%mencari frekuensi maksimum kelas mana paling banyak dari K tetangga terdekat
test_targets(i) = Uc(best);
end
My problem is that I keep getting the following MATLAB message:
??? Error using ==> minus
Matrix dimensions must agree.
I have 2 matrices:
A is NxD A =
670.00 1630.00 2380.00 1
721.00 1680.00 2400.00 1
750.00 1710.00 2440.00 1
660.00 1800.00 2150.00 1
660.00 1800.00 2150.00 1
680.00 1958.00 2542.00 1
440.00 1120.00 2210.00 2
400.00 1070.00 2280.00 2
B is MxD B =
750.00 1710.00 2440.00 1
680.00 1910.00 2440.00 1
500.00 1000.00 2325.00 2
500.00 1000.00 2325.00 2
As you can see, the 4th column says the class of the example. I am using the function like:
train_patterns = A(:,:) %HOW TO PASS A??, A(:,1:3)? A(1:size(B,1),:) ?? which????
train_targets = A(:,4) %pass the column 4 as vector of classes
test_patterns = B(:,1:3) %pass only the 3 columns
Knn = 3
So the output must be a vector 1 x M with the prediction of all B examples. How can I accomplish this?

You need to transpose A and B to go from NxD to DxN (using the ' operator).
Thus:
train_patterns = A(:,1:3)'; %'# 3-by-N
train_targets = A(:,4)'; %'# 1-by-N
test_patterns = B(:,1:3)'; %'# 3-by-M (last column will be used by you for checking)