How to select randomly and fairly data in matlab - matlab

How can I select randomly and fairly some data from a dataset in matlab?
When we use the randperm function to select data, they are random and fair?

As you already suggested, selecting k uniformly random chosen rows out of n can be done with randperm, assuming you don't want duplication.
Example:
dataSet = rand(1000,4);
idx = randperm(size(dataSet,1),10)
dataSet(idx,:)

If you have the Statistics Toolbox, you can use randsample:
sample = randsample(data,k);
takes k values sampled uniformly at random, without replacement, from the values in the vector data. See above link for other options.
Equivalent code with randperm:
ind = randperm(numel(data));
sample = data(ind(1:k));
Yes, either of these approaches gives random samples, and yes, they are fair. I assume that by "fair" you mean "uniform": each entry of data is picked with the same probability.

anything that uses uniform distribution is "fair". because the output is supposed to be distributed randomly in an specific range. for example, rand function in matlab.

Related

Clustering with multiple metrics in Matlab

I have a data set that contains both categorical and numerical features for each row. I would like to select a different similarity metric for each feature (column) and preform hierarchical clustering on the data. Is there a way to do that in Matlab?
Yes, this is actually fairly straightforward: linkage, which creates the tree, takes as input a dissimilarity matrix. Consequently, in the example workflow below
Y = pdist(X,'cityblock');
Z = linkage(Y,'average');
T = cluster(Z,'cutoff')
you simply replace the call to pdist with a call to your own function that calculates the pairwise dissimilarity between rows, everything else stays the same.

Generating similar but different vectors

I want to generate 100 vectors each of size 1x7. I have the following code currently, but when I plot it, it seems to be too linearly spaced. Is there a way to achieve a similar result only rougher?
P = randi([7 12],100,7)'/10.* repmat(randn(1,7),100,1)';
You may use different distribution for the randomizing part. randi is using uniformly distribution. You can use rng function to control random number generation. There are different generators like :
'twister' Mersenne Twister
'combRecursive' Combined Multiple Recursive
'multFibonacci' Multiplicative Lagged Fibonacci
as an example:
rng('shuffle')
rng(1);
A = rand(2,2);
rng(2);
B = rand(2,2);
it produces different numbers each time.
check this link for more info.

Parametric random number generation with MATLAB

I was wondering if it is possible to generate a random distribution that is a function of a certain parameter. In other words, using MATLAB I type rand(1,5) I have a uniformly random distribution of 5 numbers between 0 and 1. It is possible to have this result as a function of a certain parameter? Do you know any algorithm about that? I just need that in an interval don't need a 2D representation.
I think you want to do this:
http://en.wikipedia.org/wiki/Inverse_transform_sampling
In MATLAB, it's quite straightforward, you simply specify the function.
n = 10000; % number of random draws
r = rand(n, 1); % generate uniform random numbers
f = #norminv; % specify transforming function
tr = f(r); % transformed numbers, now normally distributed
hist(tr, 30) % plot histogram
This example is a bit contrived, since we could simply have used randn. But the method holds generally.
If you have the Statistics toolbox, and you want to sample from one of the popular distributions, take a look at the random number generators that are available to you, link.

Is there a statistical difference between generating many random vectors vs a single random matrix

Is there a statistical difference between generating a series of paths for a montecarlo simulation using the following two methods (note that by path I mean a vector of 350 points, normally distributed):
A)
for path = 1:300000
Zn(path, :) = randn(1, 350);
end
or the far more efficient B)
Zn = randn(300000, 350);
I just want to be sure there is no funny added correlation or dependence between the rows in method B that isn't present in method A. Like maybe method B distributes normally over 2 dimensions where A is over 1 dimension, so maybe that makes the two statistically different?
If there is a difference then I need to know the same for uniform distributions (i.e. rand instead of randn)
Just to add to the answer of #natan (+1), run the following code:
%# Store the seed
Rng1 = rng;
%# Get a matrix of random numbers
X = rand(3, 3);
%# Restore the seed
rng(Rng1);
%# Get a matrix of random numbers one vector at a time
Y = nan(3, 3);
for n = 1:3
Y(:, n) = rand(3, 1);
end
%# Test for differences
if any(any(X - Y ~= 0)); disp('Error'); end;
You'll note that there is no difference between X and Y. That is, there is no difference between building a matrix in one step, and building a matrix from a sequence of vectors.
However, there is a difference between my code and yours. Note I am populating the matrix by columns, not rows, since when rand is used to construct a matrix in one step, it populates by column. By the way, I'm not sure if you realize, but as a general rule you should always try and perform vector operations on the columns of matrices, not the rows. I explained why in a response to a question on SO the other day; see here for more...
Regarding the question of independence/dependence, one needs to be careful with the language one uses. The sequence of numbers generated by rand are perfectly dependent. For the vast majority of statistical tests, they will appear to be independent - nonetheless, in theory, one could construct a statistical test that would demonstrate the dependency between a sequence of numbers generated by rand.
Final thought, if you have a copy of Greene's "Econometric Analysis", he gives a neat discussion of random number generation in section 17.2.
As far as the base R's random number generator is concerned, also, there doesn't appear to be any difference between generating a sequence of random numbers at once or doing it one-by one. Thus, #Colin T Bowers' (+1) suggested behavior above also holds in R. Below is an R version of Colin's code:
#set seed
set.seed(1234)
# generate a sequence of 10,000 random numbers at once
X<-rnorm(10000)
# reset the seed
set.seed(1234)
# create a vector of 10,000 zeros
Y<-rep(0,times=10000)
# generate a sequence of 10,000 random numbers, one at a time
for (i in 1:10000){
Y[i]<-rnorm(1)
}
# Test for differences
if(any(X-Y!=0)){print("Error")}

Matlab - Stratified Sampling of Multidimensional Data

I want to divide a corpus into training & testing sets in a stratified fashion.
The observation data points are arranged in a Matrix A as
A=[16,3,0;12,6,4;19,2,1;.........;17,0,2;13,3,2]
Each column of the matrix represent a distinct feature.
In Matlab, the cvpartition(A,'holdout',p) function requires A to be a vector. How can I perform the same action with A as a Matrix i.e. resulting sets have roughly the same distribution of each feature as in the original corpus.
By using a matrix A rather than grouped data, you are making the assumption that a random partition of your data will return a test and train set with the same column distributions.
In general, the assumption you are making in your question is that there is a partition of A such that each of the marginal distributions of A (1 per column) has the same distribution across all three variables. There is no guarantee that this is true. Check whether the columns of your matrix are correlated. If they are not, simply partition on 1 and use the row indices to define a test matrix:
cv = cvpartition(A(:, 1), 'holdout', p);
text_mat = A(cv.test, :);
If they are correlated, you may need to go back and reconsider what you are trying to do.