matlab global stream: Any correlation between generated sets of numbers? - matlab

I'm just looking for some clarification in creating sets of random numbers in matlab and how this relates to the 'global stream.'
I know that I can set the global stream for reproducibility of my results should I run the code again:
s = RandStream('mt19937ar','Seed',7);
RandStream.setGlobalStream(s);
A = rand(1,10);
Every time I run this, A is the same. For example,
s = RandStream('mt19937ar','Seed',7);
RandStream.setGlobalStream(s);
B = rand(1,10);
I should find that isequal(A,B) is true.
Now my question pertains to the following,
s = RandStream('mt19937ar','Seed',7);
RandStream.setGlobalStream(s);
A = rand(1,10);
B = rand(1,10);
If I run this then A and B are different sets of numbers. Can I take them to be independent sets, or is there some correlation between them? If I wanted to ensure stronger independence between A and B should I create a new and different globabl stream after creating A, but before creating B? For example,
sA = RandStream('mt19937ar','Seed',7);
RandStream.setGlobalStream(sA);
A = rand(1,10);
sB = RandStream('mt19937ar','Seed',3);
RandStream.setGlobalStream(sB);
B = rand(1,10);

Matlab generate random number from a "KNOWN" but complex function,
All pseudorandom number generators are based on deterministic algorithms, and all will fail a sufficiently specific statistical test for randomness
when you change seed number (which you could do it with rng(your_desired_seed_number) too) it just use another part of the function which is not irrelevant to previous random number sequence(at least i think that way) , (it is a mathematical question)
but i suggest to use different generators to have maximum independent random number,
rng(5,'twister'); % you could also use randstream instead of rng
A=rand(1,10);
rng(3,'combRecursive');
B=rand(1,10);

Related

K-means Stopping Criteria in Matlab?

Im implementing the k-means algorithm on matlab without using the k-means built-in function, The stopping criteria is when the new centroids doesn't change by new iterations, but i cannot implement it in matlab , can anybody help?
Thanks
Setting no change as a stopping criteria is a bad idea. There are a few main reasons you shouldn't use a 0 change condition
even for a well behaved function the difference between 0 change and a very small change (say 1e-5 perhaps)could be 1000+ iterations, so you are wasting time trying to get them to be exactly the same. Especially because computers usually keep far more digits than we are interested in. IF you only need 1 digit accuracy, why wait for the computer to find an answer within 1e-31?
computers have floating point errors everywhere. Try doing some easily reversible matrix operations like a = rand(3,3); b = a*a*inv(a); a-b theoretically this should be 0 but you will see it isn't. So these errors alone could prevent your program from ever stopping
dithering. lets say we have a 1d k means problem with 3 numbers and we want to split them into 2 groups. One iteration the grouping can be a,b vs c. the next iteration could be a vs b,c the next could be a,b vs c the next.... This is of course a simplified example, but there can be instances where a few data points can dither between clusters, and you will end up with a never ending algorithm. Since those few points are reassigned, the change will never be 0
the solution is to use a delta threshold. basically you subtract the current values from the previous and if they are less than a threshold you are done. This on its own is powerful, but as with any loop, you need a backup escape plan. And that is setting a max_iterations variable. Look at matlabs documentation for kmeans, even they have a MaxIter variable (default is 100) so even if your kmeans doesn't converge, at least it wont run endlessly. Something like this might work
%problem specific
max_iter = 100;
%choose a small number appropriate to your problem
thresh = 1e-3;
%ensures it runs the first time
delta_mu = thresh + 1;
num_iter = 0;
%do your kmeans in the loop
while (delta_mu > thresh && num_iter < max_iter)
%save these right away
old_mu = curr_mu;
%calculate new means and variances, this is the standard kmeans iteration
%then store the values in a variable called curr_mu
curr_mu = newly_calculate_values;
%use the two norm to find the delta as a single number. no matter what
%the original dimensionality of mu was. If old_mu -new_mu was
% 0 the norm is still 0. so it behaves well as a distance measure.
delta_mu = norm(old_mu - curr_mu,2);
num_ter = num_iter + 1;
end
edit
if you don't know the 2 norm is essentially the euclidean distance

K-Means centroids getting marginalized to having no data points [Matlab]

So I have a sort of strange problem. I have a dataset with 240 points and I'm trying to use k-means to cluster it into 100 clusters. I'm using Matlab but I don't have access to the statistics toolbox, so I had to write my own k-means function. It's pretty simple, so that shouldn't be too hard, right? Well, it seems something is wrong with my code:
function result=Kmeans(X,c)
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
old_label = zeros(1,N);
label = ones(1,N);
iter = 0;
while ~isequal(old_label, label)
old_label = label;
label = assign_labels(X, ctrs);
for i = 1:c
ctrs(i,:) = mean(X(label == i,:));
if sum(isnan(ctrs(i,:))) ~= 0
ctrs(i,:) = zeros(1,n);
end
end
iter = iter + 1;
end
result = ctrs;
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum((X - repmat(ctrs(i,:),[N,1])).^2,2);
end
[~,label] = min(dist,[],2);
It seems what happens is that when I go to recompute the centroids, some centroids have no datapoints assigned to them, so I'm not really sure what to do with that. After doing some research on this, I found that this can happen if you supply arbitrary initial centroids, but in this case the initial centroids are taken from the datapoints themselves, so this doesn't really make sense. I've tried re-assigning these centroids to random datapoints, but that causes the code to not converge (or at least after letting it run all night, the code never converged). Basically they get re-assigned, but that causes other centroids to get marginalized, and repeat. I'm not really sure what's wrong with my code, but I ran this same dataset through R's k-means function for k=100 for 1000 iterations and it managed to converge. Does anyone know what I'm messing up here? Thank you.
Let's step through your code one piece at a time and discuss what you're doing with respect to what I know about the k-means algorithm.
function result=Kmeans(X,c)
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
old_label = zeros(1,N);
label = ones(1,N);
This looks like a function that takes in a data matrix of size N x n, where N is the number of points you have in your dataset, while n is the dimension of a point in your dataset. This function also takes in c: the desired number of output clusters.index provides a random permutation between 1 to as many data points as you have, and then we select at random c points from this permutation which you have used to initialize your cluster centres.
iter = 0;
while ~isequal(old_label, label)
old_label = label;
label = assign_labels(X, ctrs);
for i = 1:c
ctrs(i,:) = mean(X(label == i,:));
if sum(isnan(ctrs(i,:))) ~= 0
ctrs(i,:) = zeros(1,n);
end
end
iter = iter + 1;
end
result = ctrs;
For k-means, we basically keep iterating until the cluster membership of each point from the previous iteration matches with the current iteration, which is what you have going with your while loop. Now, label determines the cluster membership of each point in your dataset. Now, for each cluster that exists, you determine what the mean data point is, then assign this mean data point as the new cluster centre for each cluster. For some reason, should you experience any NaN for any dimension of your cluster centre, you set your new cluster centre to all zeroes instead. This looks very abnormal to me, and I'll provide a suggestion later. Edit: Now I understand why you did this. This is because should you have any clusters that are empty, you would simply make this cluster centre all zeroes as you wouldn't be able to find the mean of empty clusters. This can be solved with my suggestion for duplicate initial clusters towards the end of this post.
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum((X - repmat(ctrs(i,:),[N,1])).^2,2);
end
[~,label] = min(dist,[],2);
This function takes in a dataset X and the current cluster centres for this iteration, and it should return a label list of where each point belongs to each cluster. This also looks correct because for each column of dist, you are calculating the distance between each point to each cluster, where those distances are in the ith column for the ith cluster. One optimization trick that I would use is to avoid using repmat here and use bsxfun which handles the replication internally. Therefore, do this instead:
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum(bsxfun(#minus, X, ctrs(i,:)).^2, 2);
end
[~,label] = min(dist,[],2);
Now, this all looks correct. I also ran some tests myself and it all seems to work out, provided that the initial cluster centres are unique. One small problem with k-means is that we implicitly assume that all cluster centres are unique. Should they not be unique, then you'll run into a problem where two clusters (or more) have the exact same initial cluster centres.... so which cluster should the data point be assigned to? When you're doing the min in your assign_labels function, should you have two identical cluster centres, the cluster label that the point gets assigned to will be the minimum of these two numbers. This is why you will have a cluster with no points in it, as all of the points that should have been assigned to this cluster get assigned to the other.
As such, you may have two (or more) initial cluster centres that are the same upon randomization. Even though the permutation of the indices to select are unique, the actual data points themselves may not be unique upon selection. One thing that I can impose is to loop over the permutation until you get a unique set of initial clusters without repeats. As such, try doing this at the beginning of your code instead.
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
while size(unique(ctrs, 'rows'), 1) ~= c
index=randperm(N);
ctrs = X(index(1:c),:);
end
old_label = zeros(1,N);
label = ones(1,N);
iter = 0;
%// While loop appears here
This will ensure that you have a unique set of initial clusters before you continue on in your code. Now, going back to your NaN stuff inside the for loop. I honestly don't see how any dimension could result in NaN after you compute the mean if your data doesn't have any NaN to begin with. I would suggest you get rid of this in your code as (to me) it doesn't look very useful. Edit: You can now remove the NaN check as the initial cluster centres should now be unique.
This should hopefully fix your problems you're experiencing. Good luck!
"Losing" a cluster is not half as special as one may think, due to the nature of k-means.
Consider duplicates. Lets assume that all your first k points are identical, what would happen in your code? There is a reason you need to carefully handle this case. The simplest solution would be to leave the centroid as it was before, and live with degenerate clusters.
Given that you only have 240 points, but want to use k=100, don't expect too good results. Most objects will be on their own... choosing a much too large k is probably a reason why you do see this degeneration effect a lot. Let's assume out of these 240, fewer than 100 are unique... Then you cannot have 100 non-empty clusters... Plus, I would consider this kind of result "overfitting", anyway.
If you don't have the toolboxes you need in Matlab, maybe you should move on to free software. Octave, R, Weka, ELKI, ... there is plenty of software, some of which is much more powerful when it comes to clustering than pure Matlab (in particular, if you don't have the toolboxes).
Also benchmark. You will be surprised of the performance differences.

Generating Data Set in Matlab

I wanted to ask how to generate a data set in Matlab. I need it to test Feature Selection Algorithms on high dimensional data... The data set should be synthetic, multivariate and contain INTERACTING features.
Synthetic data sets like the MONKS problem is available on http://archive.ics.uci.edu/ml/datasets/MONK%27s+Problems .... unfortunately I have no clue how to visualize/generate and modify the data according to my need. The goal is to run an algorithm which detects interacting features.
I will be very thankful for a kind reply.
I'm not sure this is what you are looking for, but if I needed to do this, I would start by generating anonymous functions and generic variable names that I could apply randomly within a dataset.
For example, you could generate a dataset:
myData = rand(100,6);
and create a few functions which include interdependencies
interact = #(x) x*x;
interact2 = #(x) x*(x-1);
then create a random logical distribution
y = round(rand(100,1)); %(100 rows of random 0's or 1's)
go through the dataset and use the interact function on only rows where y is true
dataset(y == 1,:) = interact(dataset(y==1,:));
repeat the above with the other interaction functions you define if you desire. it would probably be useful to do this so that you can avoid row dependencies (see below) so generating a few datasets could be in order, i.e.
dataset2(y==1,:) = interact2(dataset(y==1,:));
A similar approach might be taken with variables (in the example set it shows some categorical variables).
myVariable = repmat('data', 100, 1);
listofvariables = genvarname(cellstr(myVariable));
y = round(rand(100,1)); % logical index for the data
randomly select a generic variable to repeat
applyvar = round(rand(1,1)*100);
selectedVariable = listofvariables(applyvar);
replace indices of the variable list with your repeated variable
listofvariables(y == 1) = selectedVariable;
put together the dataset(s) in some order of your choosing
[cellstr(num2str(dataset(:,1))) listofvariables cellstr(num2str(dataset(:,2)) cellstr(num2str(dataset2(:,2))]

Creating an uncertain model based on a family of multiple inputs multiple outputs models (MIMOs)

Objective
Currently I am trying to create an uncertain system based on a family of statespace models using ucover. For this I am basing my script on the documentation "Modeling a Family of Responses as an Uncertain System" which shows the technique for creating an uncertain system based on a single-input-single-output system (SISO) explicitly but makes it clear that this is fully useable for MIMO systems as well.
Technical details
Specifically it is stated with the documentation of ucover that it supports MIMO systems:
USYS = ucover(PARRAY,PNOM,ORD1,ORD2,UTYPE) returns an uncertain
system USYS with nominal value PNOM and whose range of behaviors
includes all LTI responses in the LTI array PARRAY. PNOM and PARRAY
can be SS, TF, ZPK, or FRD models. USYS is of class UFRD if PNOM
is an FRD model and of class USS otherwise.
ORD1 and ORD2 specify the order (number of states) of each diagonal
entry of W1 and W2. If PNOM has NU inputs and NY outputs, ORD1 and ORD2
should be vectors of length:
UTYPE ORD1 ORD2
InputMult NU-by-1 NU-by-1
OutputMult NY-by-1 NY-by-1
Additive NY-by-1 NU-by-1
In my case I am using both 2 inputs and 2 outputs so both ORD1 adn ORD2 should be 2 by 1. I am using 8 as the number of states used by W1 and W2 (just because, I will try adjusting that once this issue is sorted).
The Attempt
Based on the SISO example I have attempted to create a MIMO example, this is shown below
noInputs=2;
noOutputs=2;
noOfStates=4;
Anom=rand(noOfStates,noOfStates);
Bnom=rand(noOfStates,noInputs);
Cnom=rand(noOutputs,noOfStates);
Dnom=rand(noOutputs,noInputs);
Pnom=ss(Anom, Bnom, Cnom, Dnom);
p1 = Pnom*tf(1,[.06 1]); % extra lag
p2 = Pnom*tf([-.02 1],[.02 1]); % time delay
p3 = Pnom*tf(50^2,[1 2*.1*50 50^2]);
Parray = stack(1,p1,p2,p3);
Parrayg = frd(Parray,logspace(-1,3,60));
[P,Info] = ucover(Parrayg,Pnom,[8 8]',[8 8]','InputMult');
Wt = Info.W1;
bodemag((Pnom-Parray)/Pnom,'b--',Wt,'r'); grid
title('Relative Gaps vs. Magnitude of Wt')
The problem
Unlike the image in the documentation my uncertain model (when put through a bode plot) only shows a response on the lead diagonal. See the screenshot for what I mean:
Where blue is the individual models and red is the uncertain model
Question
How can I create an uncertain system based on a family of MIMO statespace models that correctly covers responses between all inputs and outputs?
If you use [8,8]' as your uncertainty order structure ord1,ord2, matlab will try to have two diagonal blocks in your uncertainty block each.
However matlab only supports diagonal weighting functions (due to some complications about nonconvex search) and what you are plotting is the diagonal weighting that will multiply the 2x2 full block LTI dynamic uncertainty. W1 affects the rows and W2 affects the columns of the uncertainty.
Hence you should check the samples of that uncertainty multiplied by the weights and then the plant. Then you can compare it with the uncertain model stack. Notice that your off-diagonal entries are practically zero (<1e-10) hence almost decoupled. But W1, W2 search looks for the H-infinity norm hence you don't get to see perfect covering at each block of the Bode plot. It combines the rows/columns of the required minimum uncertainty amount (see the examples on the help file). That's why you see 1 plot per each weight in the demos.
If you would like to model the each uncertainty affecting each block separately then you need to form a new augmented LFT such that the uncertainty is four 1x1(scalar) LTI dynamic uncertainty on the diagonal then you can have four entries in ord1 and ord2.
Since this is a MIMO system, you shouldn't compare things element-by-element. You are using the input-multiplicative form, so the uncertain system being created is of the form
Pnom*(I + W1*Delta*W2), where Delta is any stable (2-by-2, in this case) system, with ||Delta|| <= 1. So, to verify that the produced uncertain model "covers" your array of system, you should think of the equation
Parray = Pnom*(I + W1*Delta*W2)
and solve for Delta. Plot it (with SIGMA, say), and you will see that it is less than 1 in magnitude, for all frequencies. The Matlab code would be (multiply everything listed below, in order - my mulitplication symbol is not showing up in the posted answer...)
sigma(inv(W1)*inv(Pnom)*(Parrayg-Pnom)*inv(W2))
Now, using the syntax you specified, you are using weights W1 and W2 of the following form:
W1 = [W1_11 0;
0 W1_22]
and
W2 = [W2_11 0;
0 W2_22]
where you've specified 8th-order fits for all nonzero entries. Certainly for your example, this is overkill (although on a richer problem, it might be fine).
I would try much simpler, like
ucover(Parrag,Pnom,3,[],'InputMult')
That syntax will make an uncertain model of the form
Pnom*(I + w1*Delta)
where w1 is a scalar, 3rd order system. You could still see the covering by plotting SIGMA(Delta), namely
sigma((1/w1)*inv(Pnom)*(Parrayg-Pnom))
I hope that helps.
In order to create discrete or continuous time uncertain systems you can use uss associated with ureal.
Quick example
Define an uncertain propeller radius
% Propeller radius (m)
rp = ureal('rp',13.4e-2,'Range',[0.08 0.16]);
Define uncertain continuous time system
tenzo_unc = uss(A,Bw,Clocal,D,'statename',states,'inputname',inputs,'outputname',outputsLocal);
Simulate step response:
N = 5;
% Prende alcuni campioni del sistema incerto e calcola bound su incertezze
for i=1:1:N
sys{i} = usample(tenzo_unc);
step(sys{i})
hold on
cprintf('text','.');
end
Complete example
Quadcopter uncertain linearized model control with LQR. Code is available here
Step response
Closed Loop Step response
<script src="https://gist.github.com/GiovanniBalestrieri/f90a20780eb2496e730c8b74cf49dd0f.js"></script>
NB:
If you don't have the utility cprintf, include this script in your folder and use it.

Random number - Choose seed

because of one project I have to make use of pseudo random numbers with normal distribution.
To this respect, I'm generally putting this down:
nn_u = complex((normrnd(0,1.0,size(H_u))),(normrnd(0,1.0,size(H_u))));
nn_v = complex((normrnd(0,1.0,size(H_u))),(normrnd(0,1.0,size(H_u))));
nn_w = complex((normrnd(0,1.0,size(H_u))),(normrnd(0,1.0,size(H_u))));
size(H_u) = [4096,1];
This way I don't have any real access to the seed number. What I expect is that, using the above mentioned form, there will be 6 seeds, that means one different seed for any of the six times called normrnd function.
What I'd like to do at the moment is to generate six independent representations, just as happens above, with only one seed point, which I can pick out of the range [1,999].
To achieve this I was thinking to proceed this way:
n = 4096;
nn_tmp = normrnd(0,1,[n*6,1]);
nn_u = complex(nn_tmp(1:n,1),nn_tmp(n+1:2*n,1));
nn_v = complex(nn_tmp(2*n+1:3*n,1),nn_tmp(3*n+1:4*n,1));
nn_w = complex(nn_tmp(4*n+1:5*n,1),nn_tmp(5*n+1:6*n,1));
But this way, I don't have any direct access to the seed; I don't even know if the kind of operation I'd do has any strong theoretical validation.
Any support would be welcome.
I think you can use rng to seed and then use randn instead of normrnd for your problem
So something like
SEED = 120; %for example
rng(SEED, 'twister');
nn_u = complex(randn(size(H_u)),randn(size(H_u)));
nn_v = complex(randn(size(H_u)),randn(size(H_u)));
nn_w = complex(randn(size(H_u)),randn(size(H_u)));