Repeat random, unique sampling of k values n times - matlab

In Matlab, I would like to generate a matrix with 4 random, unique samples (out of 10) 7 times.
In order to avoid a for-loop, I thought I could just repeat my data and use datasample from Statistics and Machine Learning Toolbox on the first dimension. But it always chooses the same 4 values from each column, so this is kind of useless.
Consider the following MWE:
randomData = [50.29; 47.72; 48.38; 48.02; 44.23; 47.17; 48.19; 49.11; 50.44; 53.40];
numOfReps = 7;
numOfSamples = 4;
randomDataRepMatrix = randomData*ones(1, numOfReps);
s = RandStream('mlfg6331_64');
y = datasample(s, randomDataRepMatrix, numOfSamples, 'Replace', false);
Even without the RandStream part, I get the same results...
Any idea? Or do I need to use the for-loop after all?

I don't think datasample or randsample can produce several sets of samples in one go. Here's a "manual" way to do it (not necessarily faster than using datasample with a loop):
[~, ind] = sort(rand(numel(randomData), numOfReps)); % each column is a permutation
ind = ind(1:numOfSamples,:); % keep only the first values in each column
y = randomData(ind); % index into data

Related

Efficient generation of permutation matrices in MATLAB

I'm trying to generate a 100-by-5 matrix where every line is a permutation of 1..100 (that is, every line is 5 random numbers from [1..100] without repetitions).
So far I've only been able to do it iteratively with a for-loop. Is there a way to do it more efficiently (using fewer lines of code), without loops?
N = 100;
T = zeros(N, 5);
for i = 1:N
T(i, :) = randperm(100, 5);
end
Let
N = 100; % desired number of rows
K = 5; % desired number of columns
M = 100; % size of population to sample from
Here's an approach that's probably fast; but memory-expensive, as it generates an intermediate M×N matrix and then discards N-K rows:
[~, result] = sort(rand(N, M), 2);
result = result(:, 1:K);
There is very little downside to using a loop here, at least in this minimal example. Indeed, it may well be the best-performing solution for MATLAB's execution engine. But perhaps you don't like assigning the temporary variable i or there are other advantages to vectorization in your non-minimal implementation. Consider this carefully before blindly implementing a solution.
You need to call randperm N times, but each call has no dependency on its position in the output. Without a loop index you will need something else to regulate the number of calls, but this can be just N empty cells cell(N,1). You can use this cell array to evaluate a function that calls randperm but ignores the contents (or, rather, lack of contents) of the cells, and then reassemble the function outputs into one matrix with cell2mat:
T = cell2mat(cellfun(#(~) {randperm(100,5)}, cell(N,1)));

Draw non full matrix of random numbers

I am doing a Monte-Carlo simulation, where each repetition requires the sum or product of a random number of random variables. My problem is how to do this efficiently as the entire simulation should be as vectorized as possible.
For example, say we want to take the sum of 5, 10 and 3 random numbers, represented by the vector len = [5;10;3]. Then what I am currently doing is drawing a full matrix of random numbers:
A = randn(length(len),max(len));
Creating a mask of the non-needed numbers:
lenlen = repmat(len,1,max(len));
idx = repmat(1:max(len),length(len),1);
mask = idx>lenlen;
and then I can "pad", the matrix as I am interested in the sum the padding have to be zero (for the case with the product the padding had to be 1)
A(mask)=0;
To obtain:
A =
1.7708 -1.4609 -1.5637 -0.0340 0.9796 0 0 0 0 0
1.8034 -1.5467 0.3938 0.8777 0.6813 1.0594 -0.3469 1.7472 -0.4697 -0.3635
1.5937 -0.1170 1.5629 0 0 0 0 0 0 0
Whereafter I can sum them together
B = sum(A,2);
However, I find it rather superfluous that I have to draw too many random numbers and then throw them away. In the real case, I need in the range of hundred thousands of repetitions and the vector len might vary a lot, i.e. it can easily be that I have to draw twice or three times the number of random numbers than of what is needed.
You can generate the exact amount of random numbers required, create a grouping variable with repelem, and compute the sum of each group using accumarray:
len = [5; 10; 3];
B = accumarray(repelem(1:numel(len), len).', randn(sum(len),1));
You could just use arrayfun or a loop. You say "efficient" and "vectorized" in the same breath, but they are not necessarily the same thing - since the new(ish) JIT compiler, loops are pretty fast in MATLAB. arrayfun is basically a loop in disguise, but means you could create B like so:
len = [5;10;3];
B = arrayfun( #(x) sum( randn(x,1) ), len );
For each element in len, this creates a vector of length len(i) and takes the sum. The output is an array with one value for each value in len.
This will certainly be a lot more memory friendly for large values and largely different values within len. It may therefore be quicker, your mileage may vary but it cuts out a lot of the operations you're doing.
You mention wanting to take the product sometimes, in which case use prod in place of sum.
Edit: rough and ready benchmark to compare arrayfun and a loop...
len = randi([1e3, 1e7], 100, 1);
tic;
B = arrayfun( #(x) sum( randn(x,1) ), len );
toc % ~8.77 seconds
tic;
out=zeros(size(len));
for ii = 1:numel(len)
out(ii) = sum(randn(len(ii),1));
end
toc % ~8.80 seconds
The "advantage" of the loop over arrayfun is you can pre-generate all of the random numbers in one go, then index. This isn't necesarryily quicker because you're addressing much bigger chunks of memory, and the call to randn is the main bottleneck anyway!
tic;
out = zeros(size(len));
rnd = randn(sum(len),1);
idx = [0; cumsum(len)]; % note: cumsum is very quick (~0.001sec here) so negligible
for ii = 1:numel(len)
out(ii) = sum(rnd(idx(ii)+1:idx(ii+1)),1);
end
toc % ~10.2 sec! Slower because of massive call to randn and the indexing into large array.
As stated at the top, arrayfun and looping are basically the same under the hood, so no reason to expect a big time difference.
The sum of multiple random numbers drawn from a specific distribution is also a random number with a (different) specific distribution. Therefore you can just cut the middleman and draw directly from the latter distribution.
In your case you are summing 3, 10 and 5 numbers drawn from a N(0,1) distribution. As explained here, the resulting distributions therefore are N(0,3), N(0,10) and N(0,5). This page explains how you can draw from non-standard normal distributions in Matlab. As such, we can in this case generate those numbers with randn(3,1).*sqrt([5; 10; 3]).
In case you would want 1000 triples, you could then use
randn(3,1000).*sqrt([5; 10; 3])
or pre Matlab2016b
bsxfun(#times, randn(3,1000), sqrt([5; 10; 3]))
which is of course very fast.
Different distributions have different summation rules, but as long as you are not summing up numbers drawn from different distributions the rules are usually quite simple and found quickly with google.
You can do this using a combination of cumsum and diff. The plan is:
Create all the random numbers in a single call to randn up front
Then, use cumsum to produce a vector of cumulative summations
Use cumsum on the list of number-of-samples-per-result to work out where to read out the results
We also need diff to correct for the prior summations.
Note that this method might lose accuracy if you weren't using randn for the random samples, as cumsum would then build up arithmetic rounding errors.
% We want 100 sums of random numbers
numSamples = 100;
% Here's where we define how many random samples contribute to each sum
numRandsPerSample = randi(5, 1, numSamples);
% Let's make all the random numbers in one call
allRands = randn(1, sum(numRandsPerSample));
% Use CUMSUM to build up a cumulative sum of the whole of allRands. We also
% need a leading 0 for the first sum.
allRandsCS = [0, cumsum(allRands)];
% Use CUMSUM again to pick out the places we need to pick from
% allRandsCS
endIdxs = 1 + [0, cumsum(numRandsPerSample)];
% Use DIFF to subtract the prior sums from the result.
result = diff(allRandsCS(endIdxs))

Assign labels based on given examples for a large dataset effectively

I have matrix X (100000 X 10) and vector Y (100000 X 1). X rows are categorical and assume values 1 to 5, and labels are categorical too (11 to 20);
The rows of X are repetitive and there are only ~25% of unique rows, I want Y to have statistical mode of all the labels for a particular unique row.
And then there comes another dataset P (90000 X 10), I want to predict labels Q based on the previous exercise.
What I tried is finding unique rows of X using unique in MATLAB, and then assign statistical mode of each of these labels for the unique rows. For P, I can use ismember and carry out the same.
The issue is in the size of the dataset and it takes an 1.5-2 hours to complete the process. Is there a vectorize version possible in MATLAB?
Here is my code:
[X_unique,~,ic] = unique(X,'rows','stable');
labels=zeros(length(X_unique),1);
for i=1:length(X_unique)
labels(i)=mode(Y(ic==i));
end
Q=zeros(length(P),1);
for j=1:length(X_unique)
Q(all(repmat(X_unique(j,:),length(P),1)==P,2))=label(j);
end
You will be able to accelerate your first loop a great deal if you replace it entirely with:
labels = accumarray(ic, Y, [], #(y) mode(y));
The second loop can be accelerated by using all(bsxfun(#eq, X_unique(i,:), P), 2) inside Q(...). This is a good vectorized approach assuming your arrays are not extremely large w.r.t. the available memory on your machine. In addition, to save more time, you could use the unique trick you did with X on P, run all the comparisons on a much smaller array:
[P_unique, ~, IC_P] = unique(P, 'rows', 'stable');
EDIT:
to compute Q_unique in the following way: and then convert it back to the full array using:
Q_unique = zeros(length(P_unique),1);
for i = 1:length(X_unique)
Q_unique(all(bsxfun(#eq, X_unique(i,:), P_unique), 2)) = labels(i)
end
and convert back to Q_full to match the original P input:
Q_full = Q_unique(IC_P);
END EDIT
Finally, if memory is an issue, in addition to everything above, you might want you use a semi-vectorized approach inside your second loop:
for i = 1:length(X_unique)
idx = true(length(P), 1);
for j = 1:size(X_unique,2)
idx = idx & (X_unique(i,j) == P(:,j));
end
Q(idx) = labels(i);
% Q(all(bsxfun(#eq, X_unique(i,:), P), 2)) = labels(i);
end
This would take about x3 longer compared with bsxfun but if memory is limited then you gotta pay with speed.
ANOTHER EDIT
Depending on your version of Matlab, you could also use containers.Map to your advantage by mapping textual representations of the numeric sequences to the calculated labels. See example below.
% find unique members of X to work with a smaller array
[X_unique, ~, IC_X] = unique(X, 'rows', 'stable');
% compute labels
labels = accumarray(IC_X, Y, [], #(y) mode(y));
% convert X to cellstr -- textual representation of the number sequence
X_cellstr = cellstr(char(X_unique+48)); % 48 is ASCII for 0
% map each X to its label
X_map = containers.Map(X_cellstr, labels);
% find unique members of P to work with a smaller array
[P_unique, ~, IC_P] = unique(P, 'rows', 'stable');
% convert P to cellstr -- textual representation of the number sequence
P_cellstr = cellstr(char(P_unique+48)); % 48 is ASCII for 0
% --- EDIT --- avoiding error on missing keys in X_map --------------------
% find which P's exist in map
isInMapP = X_map.isKey(P_cellstr);
% pre-allocate Q_unique to the size of P_unique (can be any value you want)
Q_unique = nan(size(P_cellstr)); % NaN is safe to use since not a label
% find the labels for each P_unique that exists in X_map
Q_unique(isInMapP) = cell2mat(X_map.values(P_cellstr(isInMapP)));
% --- END EDIT ------------------------------------------------------------
% convert back to full Q array to match original P
Q_full = Q_unique(IC_P);
This takes about 15 seconds to run on my laptop. Most of which is consumed by computation of mode.

Can someone help vectorise this matlab loop?

i am trying to learn how to vectorise matlab loops, so im just doing a few small examples.
here is the standard loop i am trying to vectorise:
function output = moving_avg(input, N)
output = [];
for n = N:length(input) % iterate over y vector
summation = 0;
for ii = n-(N-1):n % iterate over x vector N times
summation += input(ii);
endfor
output(n) = summation/N;
endfor
endfunction
i have been able to vectorise one loop, but cant work out what to do with the second loop. here is where i have got to so far:
function output = moving_avg(input, N)
output = [];
for n = N:length(input) % iterate over y vector
output(n) = mean(input(n-(N-1):n));
endfor
endfunction
can someone help me simplify it further?
EDIT:
the input is just a one dimensional vector and probably maximum 100 data points. N is a single integer, less than the size of the input (typically probably around 5)
i don't actually intend to use it for any particular application, it was just a simple nested loop that i thought would be good to use to learn about vectorisation..
Seems like you are performing convolution operation there. So, just use conv -
output = zeros(size(input1))
output(N:end) = conv(input1,ones(1,N),'valid')./N
Please note that I have replaced the variable name input with input1, as input is already used as the name of a built-in function in MATLAB, so it's a good practice to avoid such conflicts.
Generic case: For a general case scenario, you can look into bsxfun to create such groups and then choose your operation that you intend to perform at the final stage. Here's how such a code would look like for sliding/moving average operation -
%// Create groups of indices for each sliding interval of length N
idx = bsxfun(#plus,[1:N]',[0:numel(input1)-N]) %//'
%// Index into input1 with those indices to get grouped elements from it along columns
input1_indexed = input1(idx)
%// Finally, choose the operation you intend to perform and apply along the
%// columns. In this case, you are doing average, so use mean(...,1).
output = mean(input1_indexed,1)
%// Also pre-append with zeros if intended to match up with the expected output
Matlab as a language does this type of operation poorly - you will always require an outside O(N) loop/operation involving at minimum O(K) copies which will not be worth it in performance to vectorize further because matlab is a heavy weight language. Instead, consider using the
filter function where these things are typically implemented in C which makes that type of operation nearly free.
For a sliding average, you can use cumsum to minimize the number of operations:
x = randi(10,1,10); %// example input
N = 3; %// window length
y = cumsum(x); %// compute cumulative sum of x
z = zeros(size(x)); %// initiallize result to zeros
z(N:end) = (y(N:end)-[0 y(1:end-N)])/N; %// compute order N difference of cumulative sum

How to compare two unsorted lists in Matlab?

I have two lists of 2-dimensional points given as M x 2 - and N x 2 - matrices, respectively, with M and N possibly being very large.
What is the fastest way to determine how many of them are equal?
I am not sure whether you want to count repetitive entries, but if not you could use intersect or some quite intuitive algorithm based on sorting (see below). I would not prefer a nested-loop version...
function test_compareVecs()
%% create some random data
N = 31415;
M1 = 100000;
M2 = 200000;
vec = rand(N,2);
v1 = [rand(M1-N,2); vec];
v2 = [rand(M2-N,2); vec];
v1 = v1(randperm(M1),:);
v2 = v2(randperm(M2),:);
%% intersect
disp('intersect:');
tic
s = size(intersect(v1,v2,'rows'),1);
toc;
s
%% alternative approach
disp('alternative approach:');
tic;
s = compareVecs(v1,v2);
toc;
s
end
function s = compareVecs(v1,v2)
%% create help vector
help_vec = [[v1,zeros(size(v1,1),1)]; ...
[v2,ones(size(v2,1),1)]];
%% sort by first column
% note: for some reason "sortrows(help_vec,1)" is slower
hash_vec = help_vec(:,1); % dummy hash
[~,sidx] = sort(hash_vec);
help_vec = help_vec(sidx,:);
%% diff + compare
help_vec = diff(help_vec);
s = sum(help_vec(:,1) == 0 & ...
help_vec(:,2) == 0 & ...
help_vec(:,3) ~= 0);
end
Result
intersect:
Elapsed time is 0.145717 seconds.
s = 31415
alternative approach:
Elapsed time is 0.048084 seconds.
s = 31415
Compute all pair-wise distances with pdist2 and then count pairs with zero distance. If the coordinates are float values, you may want to use a tolerance instead of comparing against zero:
%// Data:
M = 10;
N = 8;
listM = randi(10,M,2)-1;
listN = randi(10,N,2)-1;
tol = 1e-6;
%// Distance matrix:
d = pdist2(listM, listN);
%// Count:
count = sum(d(:)<tol);
This should work irrespective of the order of the points in each list, or their lengths. It is a hash-table/dictionary solution that should be fast but with memory demand linear with the lengths of the lists. Please, note that the syntax below may not be perfect, but a quick reference to the main data structures mentioned should make corrections trivial.
(1) populate a dictionary-like containers.Map, in a way that the key is a unique function of the points, e.g. num2str(M(i,1))'-'num2str(M(i,2)).
(2) Then, go over all elements of the second list, create the key just as in (1) and check if it exists. If it does, set map(key)=1 else set it to 0. In the end, all the keys consisting of common points will have 1s stored, and the rest will be zeros.
(3) Finalize by summing over the values of the map (something like sum(map.values())) which should give you the total number of unique intersections among the two sets, irrespective of the order these points appear in each list.
OBS: if you don't want to count just unique intersections but all repeated points, in (2), rather than making map(key)=1, add 1 to map(key). The rest is the same.