I have a large dataset as below. From the data, I want to randomly sample based on 'id'. Since the data has 5 ids, I would like to sample 5 ids with replacement and produce a new dataset with observations of sampled ids.
id value var1 var2 …
1 1
1 2
1 3
1 4
2 5
2 6
2 7
3 8
3 9
3 10
4 11
4 12
4 13
5 14
5 15
5 16
Let's suppose, I randomly draw 5 values from 1 to 5 (because there are 5 unique ids) and the result is (2 4 3 2 1). Then, I would like to have this data
id value var1 var2 …
2 5
2 6
2 7
4 11
4 12
4 13
3 8
3 9
3 10
2 5
2 6
2 7
1 1
1 2
1 3
1 4
Here is a sample code for ids varying from 1 through 5.
% data = [1 1; 1 2; 1 3; 1 4; 2 5; 2 6; 2 7; 3 8; 3 9; 3 10; 4 11; 4 12; 4 13;...
% 5 14; 5 15; 5 16];
data = rand(10000000,10);
data(:,1) = randi([1,5], length(data),1);
% Get all the indices from the 1st column;
indxCell = cell(5,1);
for i=1:5
tmpIndx = find(data(:,1) == i);
indxCell{i} = tmpIndx;
end
% Rearrange the indices
randIndx = randperm(5);
randIndxCell = indxCell(randIndx, 1);
% Generate a vector of indices by rearranging the 1st column of data matrix.
numDataPts = length(data);
newIndices = zeros(numDataPts,1);
endIndx = 1;
for i=1:5
startIndx = endIndx;
endIndx = startIndx + length(randIndxCell{i});
newIndices(startIndx:endIndx-1, 1) = randIndxCell{i};
end
newData = data(newIndices,:);
For more unique ids, you could modify the code.
Edits: Modified the data size and also rewrote the 2nd for-loop.
Related
Suppose I have a list of length 2k, say {1,2,...,2k}. The number of possible ways of grouping the 2k numbers into k (unordered) pairs is n(k) = 1*3* ... *(2k-1). So for k=2, we have the following three different ways of forming 2 pairs
(1 2)(3 4)
(1 3)(2 4)
(1 4)(2 3)
How can I use Matlab to create the above list, i.e., create a matrix of n(k)*(2k) such that each row contains a different way of grouping the list of 2k numbers into k pairs.
clear
k = 3;
set = 1: 2*k;
p = perms(set); % get all possible permutations
% sort each two column
[~, col] = size(p);
for i = 1: 2: col
p(:, i:i+1) = sort(p(:,i:i+1), 2);
end
p = unique(p, 'rows'); % remove the same row
% sort each row
[row, col] = size(p);
for i = 1: row
temp = reshape(p(i,:), 2, col/2)';
temp = sortrows(temp, 1);
p(i,:) = reshape(temp', 1, col);
end
pairs = unique(p, 'rows'); % remove the same row
pairs =
1 2 3 4 5 6
1 2 3 5 4 6
1 2 3 6 4 5
1 3 2 4 5 6
1 3 2 5 4 6
1 3 2 6 4 5
1 4 2 3 5 6
1 4 2 5 3 6
1 4 2 6 3 5
1 5 2 3 4 6
1 5 2 4 3 6
1 5 2 6 3 4
1 6 2 3 4 5
1 6 2 4 3 5
1 6 2 5 3 4
As someone think my former answer is not useful, i post this.
I have the following brute force way of enumerating the pairs. Not particularly efficient. It can also cause memory problem when k>9. In that case, I can just enumerate but not create Z and store the result in it.
function Z = pair2(k)
count = [2*k-1:-2:3];
tcount = prod(count);
Z = zeros(tcount,2*k);
x = [ones(1,k-2) 0];
z = zeros(1,2*k);
for i=1:tcount
for j=k-1:-1:1
if x(j)<count(j)
x(j) = x(j)+1;
break
end
x(j) = 1;
end
y = [1:2*k];
for j=1:k-1
z(2*j-1) = y(1);
z(2*j) = y(x(j)+1);
y([1 x(j)+1]) = [];
end
z(2*k-1:2*k) = y;
Z(i,:) = z;
end
k = 3;
set = 1: 2*k;
combos = combntns(set, k);
[len, ~] = size(combos);
pairs = [combos(1:len/2,:) flip(combos(len/2+1:end,:))];
pairs =
1 2 3 4 5 6
1 2 4 3 5 6
1 2 5 3 4 6
1 2 6 3 4 5
1 3 4 2 5 6
1 3 5 2 4 6
1 3 6 2 4 5
1 4 5 2 3 6
1 4 6 2 3 5
1 5 6 2 3 4
You can also use nchoosek instead of combntns. See more at combntns or nchoosek
a = [1 1 1 1 2 2 2 2 3 3 3 3; 1 2 3 4 5 6 7 8 9 10 11 12]';
What is the quickest way to sum the values in column 2 that correspond to the first and last occurrence of each number in column 1?
The desired output:
1 5
2 13
3 21
EDIT: The result should be the same if the numbers in column 1 are ordered differently.
a = [2 2 2 2 1 1 1 1 3 3 3 3; 1 2 3 4 5 6 7 8 9 10 11 12]';
2 5
1 13
3 21
You can use accumarray as follows. Not sure how fast it is, especially because it uses a custom anonymous function:
[u, ~, v] = unique(a(:,1), 'stable');
s = accumarray(v, a(:,2), [], #(x) x(1)+x(end));
result = [u s];
If the values in the first column of a are always in contiguous groups, the following approach can be used as well:
ind_diff = find(diff(a(:,1))~=0);
ind_first = [1; ind_diff+1];
ind_last = [ind_diff; size(a,1)];
s = a(ind_first,2) + a(ind_last,2);
result = [unique(a(:,1), 'stable') s];
As the title says, I want to find all rows in a Matlab matrix that in certain columns the values in the row are equal with the values in the previous row, or in general, equal in some row in the matrix. For example I have a matrix
1 2 3 4
1 2 8 10
4 5 7 9
2 3 6 4
1 2 4 7
and I want to find the following rows:
1 2 3 4
1 2 3 10
1 2 4 7
How do I do something like that and how do I do it generally for all the possible pairs in columns 1 and 2, and have equal values in previous rows, that exist in the matrix?
Here's a start to see if we're headed in the right direction:
>> M = [1 2 3 4;
1 2 8 10;
4 5 7 9;
2 3 6 4;
1 2 4 7];
>> N = M; %// copy M into a new matrix so we can modify it
>> idx = ismember(N(:,1:2), N(1,1:2), 'rows')
idx =
1
1
0
0
1
>> N(idx, :)
ans =
1 2 3 4
1 2 8 10
1 2 4 7
Then you can remove those rows from the original matrix and repeat.
>> N = N(~idx,:)
N =
4 5 7 9
2 3 6 4
this will give you the results
data1 =[1 2 3 4
1 2 8 10
4 5 7 9
2 3 6 4
1 2 4 7];
data2 = [1 2 3 4
1 2 3 10
1 2 4 7];
[exists,position] = ismember(data1,data2, 'rows')
where the exists vector tells you wheter the row is on the other matrix and position gives you the position...
a less elegant and simpler version would be
array_data1 = reshape (data1',[],1);
array_data2 = reshape (data2',[],1);
matchmatrix = zeros(size(data2,1),size(data1,1));
for irow1 = 1: size(data2,1)
for irow2 = 1: size(data1,1)
matchmatrix(irow1,irow2) = min(data2(irow1,:) == data1(irow2,:))~= 0;
end
end
the matchmatrix is to read as a connectivity matrix where value of 1 indicates which row of data1 matches with which row of data2
How to generate random matrix without repetition in rows and cols with specific range
example (3x3): range 1 to 3
2 1 3
3 2 1
1 3 2
example (4x4): range 1 to 4
4 1 3 2
1 3 2 4
3 2 4 1
2 4 1 3
A way of approaching this problem is to generate a circular matrix and shuffle it.
mat_size = 4
A = gallery('circul', 1:mat_size); % circular matrix
B = A( randperm(length(A)) , randperm(length(A)) ); % shuffle rows and columns with randperm
It gives
A =
1 2 3 4
4 1 2 3
3 4 1 2
2 3 4 1
B =
3 4 1 2
2 3 4 1
4 1 2 3
1 2 3 4
This method should be fast. An 11 size problem is computed in 0.047021 seconds.
This algorithm will do the trick, assuming you want to contain all elements between 1 and n
%// Elements to be contained, but no zero allowed
a = [1 2 3 4];
%// all possible permutations and its size
n = numel(a);
%// initialization
output = zeros(1,n);
ii = 1;
while ii <= n;
%// random permuation of input vector
b = a(randperm(n));
%// concatenate with already found values
temp = [output; b];
%// check if the row chosen in this iteration already exists
if ~any( arrayfun(#(x) numel(unique(temp(:,x))) < ii+1, 1:n) )
%// if not, append
output = temp;
%// increase counter
ii = ii+1;
end
end
output = output(2:end,:) %// delete first row with zeros
It definitely won't be the fastest implementation. I would be curios to see others.
The computation time increases exponentially. But everything up to 7x7 is bearable.
I wrote another code (interesting to compare timings and, if possible, to make it parallel). Also had problem with perms (needed to restart Matlab to be able to generate for 11 elements, I have x64 and 16GB of memory). Than I decided to keep characters instead of the numbers, reducing the memory occupied by the matrix. It, of course, generates all permutations, and I shuffle them in the beginning, selecting in the loop in a new random order. It runs faster this way and 'eats' less memory. Time for 11 x 11 (of course it differs from run to run) is shown in results.
clear all;
t = cputime;
sze = 11;
variations = perms(char(1 : sze)); % permutations
varN = length(variations);
variations = variations(randperm(varN)', :); % shuffle
sudoku = zeros(sze, sze);
sudoku(1, :) = variations(1, :); % set the first row
indx = 2;
for ii = 2 : varN
% take a random index
rowVal = variations(ii, :);
% check that row numbers do not present in table at
% corresponding columns
if (~isempty(find(repmat(rowVal, sze, 1) - sudoku == 0, 1)))
continue;
end;
sudoku(indx, :) = rowVal;
disp(['Found row ' num2str(indx)]);
indx = indx + 1;
if indx > sze, break; end;
end;
disp(cputime - t);
disp(sudoku);
Result
252.9712 seconds
7 11 3 9 6 2 4 1 8 10 5
1 9 6 3 10 7 11 5 2 4 8
9 6 11 8 2 10 1 7 4 5 3
4 10 7 11 1 8 5 2 6 3 9
2 5 9 1 3 6 8 4 10 7 11
10 3 5 6 7 4 2 9 11 8 1
6 4 2 10 8 5 3 11 9 1 7
3 8 10 4 11 1 7 6 5 9 2
11 1 8 5 4 9 6 3 7 2 10
5 2 4 7 9 3 10 8 1 11 6
8 7 1 2 5 11 9 10 3 6 4
Here's a memory-efficient approach. The time it takes is random, but not very large. All possible output matrices are equally likely.
This works by randomly filling the matrix until no more positions are available or until the whole matrix has been filled. The code is commented so it should be obvious how it works.
For size 11 this takes of the order of a few thousands or tens of thousands attempts. On my old laptop that means a (random) running time from a few seconds to tens of seconds.
It could perhaps be sped up using uint8 values instead of double. I don't think that brings a large gain, though.
The code:
clear all
n = 11; %// matrix size
[ ii jj ] = ndgrid(1:n); %// rows and columns of S
ii = ii(:);
jj = jj(:);
success = 0; %// ...for now
attempt = 0; %// attempt count (not really needed)
while ~success
attempt = attempt + 1;
S = NaN(n, n); %// initiallize result. NaN means position not filled yet
t = 1; %// number t is being placed within S ...
u = 1; %// ... for the u-th time
mask = true(1, numel(ii)); %// initiallize mask of available positions
while any(mask) %// while there are available positions
available = find(mask); %// find available positions
r = randi(numel(available), 1); %// pick one available position
itu = ii(available(r)); %// row of t, u-th time
jtu = jj(available(r)); %// col of t, u-th time
S(itu, jtu) = t; %// store t at that position
remove = (ii==itu) | (jj==jtu);
mask(remove) = false; %// update mask of positions available for t
u = u+1; %// next u
if u > n %// we are done with number t
t = t+1; %// let's go with new t
u = 1; %// initiallize u
mask = isnan(S(:)); %// initiallize mask for this t
end
if t > n %// we are done with all numbers
success = 1; %// exit outer loop (inner will be exited too)
end
end
end
disp(attempt) %// display number of attempts
disp(S) %// show result
An example result:
10 11 8 9 7 2 3 4 1 6 5
8 4 2 1 10 11 6 5 7 9 3
2 3 5 6 11 8 1 10 4 7 9
9 8 7 4 6 10 11 3 5 1 2
3 5 9 8 2 1 4 7 6 11 10
11 9 4 5 3 6 2 1 8 10 7
1 2 6 3 8 7 5 9 10 4 11
7 1 11 10 5 4 9 8 2 3 6
4 7 1 2 9 3 10 6 11 5 8
6 10 3 11 1 5 7 2 9 8 4
5 6 10 7 4 9 8 11 3 2 1
i want to control the creation of random numbers in this matrix :
Mp = floor(1+(10*rand(2,20)));
mp1 = sort(Mp,2);
i want to modify this code in order to have an output like this :
1 1 2 2 3 3 3 4 5 5 6 7 7 8 9 9 10 10 10 10
1 2 3 3 3 3 3 3 4 5 6 6 6 6 7 8 9 9 9 10
i have to fill each row with all the numbers going from 1 to 10 in an increasing order and the second matrix that counts the occurences of each number should be like this :
1 2 1 2 1 2 3 1 1 2 1 1 2 1 1 2 1 2 3 4
1 1 1 2 3 4 5 6 1 1 1 2 3 4 1 1 1 2 3 1
and the most tricky matrix that i'v been looking for since the last week is the third matrix that should skim through each row of the first matrix and returns the numbers of occurences of each number and the position of the last occcurence.here is an example of how the code should work. this example show the intended result after running through the first row of the first matrix.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (positions)
1 2
2 2
3 3
4 1
5 2
6 1
7 2
8 1
9 2
10 4
(numbers)
this example show the intended result after running through the second row of the first matrix.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 (positions)
1 1 2
2 1 2
3 3 6
4 1 1
5 3
6 1 4
7 2 1
8 1 1
9 2 3
10 4
(numbers)
so the wanted matrix must be filled up with zeros from the beginning and each time after running through each row of the first matrix, we add the new result to the previous one...
I believe the following code does everything you asked for. If I didn't understand, you need to get a lot clearer in how you pose your question...
Note - I hard coded some values / sizes. In "real code" you would never do that, obviously.
% the bit of code that generates and sorts the initial matrix:
Mp = floor(1+(10*rand(2,20)));
mp1 = sort(Mp, 2);
clc
disp(mp1)
occCount = zeros(size(mp1));
for ii = 1:size(mp1,1)
for jj = 1:size(mp1,2)
if (jj == 1)
occCount(ii,jj) = 1;
else
if (mp1(ii,jj) == mp1(ii,jj-1))
occCount(ii,jj) = occCount(ii, jj-1) + 1;
else
occCount(ii,jj) = 1;
end
end
end
end
% this is the second matrix you asked for
disp(occCount)
% now the third:
big = zeros(10, 20);
for ii = 1:size(mp1,1)
for jj = 1:10
f = find(mp1(ii,:) == jj); % index of all of them
if numel(f) > 0
last = f(end);
n = numel(f);
big(jj, last) = big(jj, last) + n;
end
end
end
disp(big)
Please see if this is indeed what you had in mind.
The following code solves both the second and third matrix generation problems with a single loop. For clarity, the second matrix M2 is the 2-by-20 array in the example containing the cumulative occurrence count. The third matrix M3 is the sparse matrix of size 10-by-20 in the example that encodes the number and position of the last occurrence of each unique value. The code only loops over the rows, using accumarray to do most of the work. It is generalized to any size and content of mp1, as long as the rows are sorted first.
% data
mp1 = [1 1 2 2 3 3 3 4 5 5 6 7 7 8 9 9 10 10 10 10;
1 2 3 3 3 3 3 3 4 5 6 6 6 6 7 8 9 9 9 10]; % the example first matrix
nuniq = max(mp1(:));
% accumulate
M2 = zeros(size(mp1));
M3 = zeros(nuniq,size(mp1,2));
for ir=1:size(mp1,1),
cumSums = accumarray(mp1(ir,:)',1:size(mp1,2),[],#numel,[],true)';
segments = arrayfun(#(x)1:x,nonzeros(cumSums),'uni',false);
M2(ir,:) = [segments{:}];
countCoords = accumarray(mp1(ir,:)',1:size(mp1,2),[],#max,[],true);
[ii,jj] = find(countCoords);
nzinds = sub2ind(size(M3),ii,nonzeros(countCoords));
M3(nzinds) = M3(nzinds) + nonzeros(cumSums);
end
I won't print the outputs because they are a bit big for the answer, and the code is runnable as is.
NOTE: For new test data, I suggest using the commands Mp = randi(10,[2,20]); mp1 = sort(Mp,2);. Or based on your request to user2875617 and his response, ensure all numbers with mp1 = sort([repmat(1:10,2,1) randi(10,[2,10])],2); but that isn't really random...
EDIT: Error in code fixed.
I am editing the previous answer to check if it is fast when mp1 is large, and apparently it is:
N = 20000; M = 200; P = 100;
mp1 = sort([repmat(1:P, M, 1), ceil(P*rand(M,N-P))], 2);
tic
% Initialise output matrices
out1 = zeros(M, N); out2 = zeros(P, N);
for gg = 1:M
% Frequencies of each row
freqs(:, 1) = mp1(gg, [find(diff(mp1(gg, :))), end]);
freqs(:, 2) = histc(mp1(gg, :), freqs(:, 1));
cumfreqs = cumsum(freqs(:, 2));
k = 1;
for hh = 1:numel(freqs(:, 1))
out1(gg, k:cumfreqs(hh)) = 1:freqs(hh, 2);
out2(freqs(hh, 1), cumfreqs(hh)) = out2(freqs(hh, 1), cumfreqs(hh)) + freqs(hh, 2);
k = cumfreqs(hh) + 1;
end
end
toc