match data sample matlab - matlab

Ok this is going to sound really confusing but I will try my best to make it clear enough. I have a full dataset called fulldata this dataset is 494021x6.
I use svds (singular value decomposition) on it like so:
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
I then randomly select 1000 rows from the fulldata:
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';
%# pick columns in a set order (2,4,5,3,6,1)
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
I then apply normalization to this randomly selected 1000 rows:
% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));
I then output a datasample from the original fulldata set which matches the 1000 selected rows:
% output matching data
dataSample = fulldata(indX, :)
Also note that when I picked "random rows" I also output the indX rows which match the rows in the fulldata.
So datasample looks like this:
Which is the 1000 random rows which match the original fulldata.
And indX looks like this:
Which is the corresponding row number from fulldata.
The problem im arriving at is when I use K-Means to cluster the 1000 random rows and I output the data of each cluster like so:
%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
% output the contents of each cluster
K1 = data(clustIDX==1,:)
K2 = data(clustIDX==2,:)
K3 = data(clustIDX==3,:)
K4 = data(clustIDX==4,:)
K5 = data(clustIDX==5,:)
K6 = data(clustIDX==6,:)
How can I match K1, k2... K6 to the corresponding indX row number? For instance K1's output looks like this:
I was hoping to have extra files like K1-indX which is just a list of corresponding row numbers from indX which match the cluster data from K1, K2... etc. Or possibly append the indX row number into the K1, K2 output in column 7 (preferable)
For instance:
K1 cluster data | Belongs to fulldata row number
0.4 0.5 0.6 0.4 | 456456 etc

An example to illustrate:
%# lets use an example data of size 150x4
load fisheriris
fulldata = meas;
%# pick 100 rows at random
rIdx = randperm(size(fulldata,1));
rIdx = rIdx(1:100)'; %#'
data = fulldata(rIdx,:);
%# cluster the subset data
K = 3;
clustIDX = kmeans(data, K);
%# divide the data according to which cluster instances were assigned to
groupedIdx = cell(K,1);
groupedData = cell(K,1);
for i=1:K
%# instances
groupedData{i} = data(clustIDX==i,:);
%# corresponding row indices into the original fulldata
groupedIdx{i} = rIdx(clustIDX==i);
end
%# check: these two should be equal
groupedData{1}(1,:)
fulldata(groupedIdx{1}(1),:)

Unless I am mis-interpreting something above, you already have (in indX) the fulldata row numbers... All you need to do to see, for example, the rows from fulldata in cluster 1 is:
fulldata(indX(clustIDX == 1), :)
kmeans does not re-order the data, so each row 1:1000 of clustIDX still corresponds to the same row 1:1000 of data / datasample that you started with.
Said another way, clustIDX is going to be a vector of length 1000 where each element is the (integer) cluster assignment for that row. Thus you can use this for logical indexing anywhere you have 1000 rows in an order corresponding to the sample data you used for clustering.

Related

Potting sampling result frequency in histogram

I am just starting to learn Matlab.
Case:
From 3 elements, let's say 1,2, and 3. I want to sample 2 elements randomly. I want to simulate it 100 times to see the probability of the outcomes pair.
How can I plot the result on histogram that I can visualize the frequency of each pair. So far, I can do the sampling :
for i=1:100
datasample(1:3,2,'Replace',true)
end
So possible outcome is (1,1),(1,2),(2,1),(2,3), etc.
How can I plot the frequency of the outcome using histogram?
Thanks in advance
n = 100;
% generate data random
arr = zeros(n, 2);
for i = 1:n
arr(i, :) = randi([1,3],1,2);
end
% frequency
[ii, jj, kk] = unique(arr, 'rows', 'stable');
f = histc(kk, 1:numel(jj));
result = [ii f];
% plot
cuts = strcat(num2str(result(:,1)), '-',num2str(result(:,2)));
bar(result(:,3))
grid on
xlabel('combination')
ylabel('frequency')
set(gca,'xticklabel',{cuts});
set(gca,'XTickLabelRotation',45);

Matlab matrix accessing columns

Can someone explain what is happening here?
I know that Y(:,1) is the first column of values in Y.
But what does Y(p,1) mean?
Matlab noob
Y = load('testFile.txt'); %Load file
p = randperm(length(Y)); %List of random numbers between 0 and Y size
Y(:,1) = Y(p,1);
Y(:,2) = Y(p,2);
Y(:,3) = Y(p,4);
Y = load('testFile.txt'); %Load file
% create a mapping to shuffle the rows of Y.
% This assumes more rows than columns in Y, otherwise the code doesn't make sense.
% This is because length(Y) returns the length of the largest dimension.
% A more robust implementation would use size(Y,1) rather than length(Y).
p = randperm(length(Y)); %List of random numbers between 0 and Y size
% Rearrange the rows of column 1 based on the shuffled order
Y(:,1) = Y(p,1);
% Rearrange the rows of column 2 based on the shuffled order
Y(:,2) = Y(p,2);
% Set the third column to the shuffled fourth column.
Y(:,3) = Y(p,4);

Count the number of unique values for each column of a submatrix in a fast manner

I have a matrix X with tens of rows and thousands of columns, all elements are categorical and re-organized to an index matrix. For example, ith column X(:,i) = [-1,-1,0,2,1,2]' is converted to X2(:,i) = ic of [x,ia,ic] = unique(X(:,i)), for convenient use of function accumarray. I randomly selected a submatrix from the matrix and counted the number of unique values of each column of the submatrix. I performed this procedure 10,000 times. I know several methods for counting number of unique values in a column, the fasted way I found so far is shown below:
mx = max(X);
for iter = 1:numperm
for j = 1:ny
ky = yrand(:,iter)==uy(j);
% select submatrix from X where all rows correspond to rows in y that y equals to uy(j)
Xk = X(ky,:);
% specify the sites where to put the number of each unique value
mxj = mx*(j-1);
mxi = mxj+1;
mxk = max(Xk)+mxj;
% iteration to count number of unique values in each column of the submatrix
for i = 1:c
pxs(mxi(i):mxk(i),i) = accumarray(Xk(:,i),1);
end
end
end
This is a way to perform random permutation test to calculate information gain between a data matrix X of size n by c and categorical variable y, under which y is randomly permutated. In above codes, all randomly permutated y are stored in matrix yrand, and the number of permutations is numperm. The unique values of y are stored in uy and the unique number is ny. In each iteration of 1:numperm, submatrix Xk is selected according to the unique element of y and number of unique elements in each column of this submatrix is counted and stored in matrix pxs.
The most time costly section in the above code is the iterations of i = 1:c for large c.
Is it possible to perform the function accumarray in a matrix manner to avoid for loop? How else can I improve the above code?
-------
As requested, a simplified test function including above codes is provided as
%% test
function test(x,y)
[r,c] = size(x);
x2 = x;
numperm = 1000;
% convert the original matrix to index matrix for suitable and fast use of accumarray function
for i = 1:c
[~,~,ic] = unique(x(:,i));
x2(:,i) = ic;
end
% get 'numperm' rand permutations of y
yrand(r, numperm) = 0;
for i = 1:numperm
yrand(:,i) = y(randperm(r));
end
% get statistic of y
uy = unique(y);
nuy = numel(uy);
% main iterations
mx = max(x2);
pxs(max(mx),c) = 0;
for iter = 1:numperm
for j = 1:nuy
ky = yrand(:,iter)==uy(j);
xk = x2(ky,:);
mxj = mx*(j-1);
mxk = max(xk)+mxj;
mxi = mxj+1;
for i = 1:c
pxs(mxi(i):mxk(i),i) = accumarray(xk(:,i),1);
end
end
end
And a test data
x = round(randn(60,3000));
y = [ones(30,1);ones(30,1)*-1];
Test the function
tic; test(x,y); toc
return Elapsed time is 15.391628 seconds. in my computer. In the test function, 1000 permutations is set. So if I perform 10,000 permutation and do some additional computations (are negligible comparing to the above code), time more than 150 s is expected. I think whether the code can be improved. Intuitively, perform accumarray in a matrix manner can save lots of time. Can I?
The way suggested by #rahnema1 has significantly improved the calculations, so I posted my answer here, as also requested by #Dev-iL.
%% test
function test(x,y)
[r,c] = size(x);
x2 = x;
numperm = 1000;
% convert the original matrix to index matrix for suitable and fast use of accumarray function
for i = 1:c
[~,~,ic] = unique(x(:,i));
x2(:,i) = ic;
end
% get 'numperm' rand permutations of y
yrand(r, numperm) = 0;
for i = 1:numperm
yrand(:,i) = y(randperm(r));
end
% get statistic of y
uy = unique(y);
nuy = numel(uy);
% main iterations
mx = max(max(x2));
% preallocation
pxs(mx*nuy,c) = 0;
% set the edges of the bin for function histc
binrg = (1:mx)';
% preallocation of the range of matrix into which the results will be stored
mxr = mx*(0:nuy);
for iter = 1:numperm
yt = yrand(:,iter);
for j = 1:nuy
pxs(mxr(j)+1:mxr(j),:) = histc(x2(yt==uy(j)),binrg);
end
end
Test results:
>> x = round(randn(60,3000));
>> y = [ones(30,1);ones(30,1)*-1];
>> tic; test(x,y); toc
Elapsed time is 15.632962 seconds.
>> tic; test(x,y); toc % using the way suggested by rahnema1, i.e., revised function posted above
Elapsed time is 2.900463 seconds.

Plot a matrix in graph with two axis in matlab

I need to plot a NxN matrix 'M' full of zeros, but only show the cases where m(x,y) is different from 0.
t_max = 10; % set the maximum number of iterations
n = 10; % dimension n*n
d = 1; % the probability of changing place
x = randi([1 n]); % random row
y = randi([1 n]); % random column
grid = zeros(10); % set an empty gride n*n
grid(x,y) = 1; % put an agent in a random place
for t=1:t_max
newgrid = randomwalk1(grid,d); % call the function random walk for one agent
end
I tried image(m) but it's not giving satisfying results since I need also to keep track of the element that is different to 0, hold on doesn't work in this case.
You are looking for the spy() function. Just type spy(m) and see what happens.

Show rows on clustered kmeans data

Hi I was wondering when you cluster data on the figure screen is there a way to show which rows the data points belong to when you scroll over them?
From the picture above I was hoping there would be a way in which if I select or scroll over the points that I could tell which row it belonged to.
Here is the code:
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows);
%# pick random columns
indY = randperm( size(fulldata,2) );
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
%% apply normalization method to every cell
data = data./repmat(sqrt(sum(data.^2)),size(data,1),1);
%% generate sample data
K = 6;
numObservarations = 1000;
dimensions = 6;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
Or possibly an output method of the clusters data, normalized and re-organized to there original format with appedicies on the end column with which row it belonged to from the original "fulldata".
You could use the data cursors feature which displays a tooltip when you select a point from the plot. You can use a modified update function to display all sorts of information about the point selected.
Here is a working example:
function customCusrorModeDemo()
%# data
D = load('fisheriris');
data = D.meas;
[clustIdx,labels] = grp2idx(D.species);
K = numel(labels);
clr = hsv(K);
%# instance indices grouped according to class
ind = accumarray(clustIdx, 1:size(data,1), [K 1], #(x){x});
%# plot
%#gscatter(data(:,1), data(:,2), clustIdx, clr)
hLine = zeros(K,1);
for k=1:K
hLine(k) = line(data(ind{k},1), data(ind{k},2), data(ind{k},3), ...
'LineStyle','none', 'Color',clr(k,:), ...
'Marker','.', 'MarkerSize',15);
end
xlabel('SL'), ylabel('SW'), zlabel('PL')
legend(hLine, labels)
view(3), box on, grid on
%# data cursor
hDCM = datacursormode(gcf);
set(hDCM, 'UpdateFcn',#updateFcn, 'DisplayStyle','window')
set(hDCM, 'Enable','on')
%# callback function
function txt = updateFcn(~,evt)
hObj = get(evt,'Target'); %# line object handle
idx = get(evt,'DataIndex'); %# index of nearest point
%# class index of data point
cIdx = find(hLine==hObj, 1, 'first');
%# instance index (index into the entire data matrix)
idx = ind{cIdx}(idx);
%# output text
txt = {
sprintf('SL: %g', data(idx,1)) ;
sprintf('SW: %g', data(idx,2)) ;
sprintf('PL: %g', data(idx,3)) ;
sprintf('PW: %g', data(idx,4)) ;
sprintf('Index: %d', idx) ;
sprintf('Class: %s', labels{clustIdx(idx)}) ;
};
end
end
Here is how it looks like in both 2D and 3D views (with different display styles):