Iregular plot of k-means clustering, outlier removal

Iregular plot of k-means clustering, outlier removal - matlab

Hi I'm working on trying to cluster network data from the 1999 darpa data set. Unfortunately I'm not really getting clustered data, not compared to some of the literature, using the same techniques and methods.
My data comes out like this:
As you can see, it is not very Clustered. This is due to a lot of outliers (noise) in the dataset. I have looked at some outlier removal techniques but nothing I have tried so far really cleans the data. One of the methods I have tried:
%% When an outlier is considered to be more than three standard deviations away from the mean, determine the number of outliers in each column of the count matrix:
mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 3*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers)
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];
In the first run, it removed 48 rows from the 1000 normalized random rows which are selected from the full dataset.
This is the full script I used on the data:
%% load data
%# read the list of features
fid = fopen('kddcup.names','rt');
C = textscan(fid, '%s %s', 'Delimiter',':', 'HeaderLines',1);
fclose(fid);
%# determine type of features
C{2} = regexprep(C{2}, '.$',''); %# remove "." at the end
attribNom = [ismember(C{2},'symbolic');true]; %# nominal features
%# build format string used to read/parse the actual data
frmt = cell(1,numel(C{1}));
frmt( ismember(C{2},'continuous') ) = {'%f'}; %# numeric features: read as number
frmt( ismember(C{2},'symbolic') ) = {'%s'}; %# nominal features: read as string
frmt = [frmt{:}];
frmt = [frmt '%s']; %# add the class attribute
%# read dataset
fid = fopen('kddcup.data_10_percent_corrected','rt');
C = textscan(fid, frmt, 'Delimiter',',');
fclose(fid);
%# convert nominal attributes to numeric
ind = find(attribNom);
G = cell(numel(ind),1);
for i=1:numel(ind)
[C{ind(i)},G{i}] = grp2idx( C{ind(i)} );
end
%# all numeric dataset
fulldata = cell2mat(C);
%% dimensionality reduction
columns = 6
[U,S,V]=svds(fulldata,columns);
%% randomly select dataset
rows = 1000;
columns = 6;
%# pick random rows
indX = randperm( size(fulldata,1) );
indX = indX(1:rows)';
%# pick random columns
indY = indY(1:columns);
%# filter data
data = U(indX,indY);
% apply normalization method to every cell
maxData = max(max(data));
minData = min(min(data));
data = ((data-minData)./(maxData));
% output matching data
dataSample = fulldata(indX, :)
%% When an outlier is considered to be more than three standard deviations away from the mean, use the following syntax to determine the number of outliers in each column of the count matrix:
mu = mean(data)
sigma = std(data)
[n,p] = size(data);
% Create a matrix of mean values by replicating the mu vector for n rows
MeanMat = repmat(mu,n,1);
% Create a matrix of standard deviation values by replicating the sigma vector for n rows
SigmaMat = repmat(sigma,n,1);
% Create a matrix of zeros and ones, where ones indicate the location of outliers
outliers = abs(data - MeanMat) > 2.5*SigmaMat;
% Calculate the number of outliers in each column
nout = sum(outliers)
% To remove an entire row of data containing the outlier
data(any(outliers,2),:) = [];
%% generate sample data
K = 6;
numObservarations = size(data, 1);
dimensions = 3;
%% cluster
opts = statset('MaxIter', 100, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 5, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
grid on
view([90 0]);
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
This is two distinct clusters from the output:
As you can see the data looks cleaner and more clustered than the original. However I still think a better method can be used.
For instance observing the overall clustering, I still have a lot of noise (outliers) from the dataset. As can be seen here:
I need the outlier rows put into a seperate dataset for later classification (only removed from the clustering)
Here is a link to the darpa dataset, please note that the 10% data set has had significant reduction in columns, a majority of columns which have 0 or 1's running through-out have been removed (42 columns to 6 columns):
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
EDIT
Columns kept in the dataset are:
src_bytes: continuous.
dst_bytes: continuous.
count: continuous.
srv_count: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
RE-EDIT:
Based on discussions with Anony-Mousse and his answer, there may be a way of reducing noise in the clustering using K-Medoids http://en.wikipedia.org/wiki/K-medoids. I'm hoping that there isnt much of a change in the code that I currently have but as of yet I do not know how to implement it to test whether this will significantly reduce the noise. So providing that someone can show me a working example this will be accepted as an answer.

Note that using this dataset is discouraged:
That dataset has errors: KDD Cup '99 dataset (Network Intrusion) considered harmful
Reconsider using a different algorithm. k-means is not really appropriate for mixed-type data, where many attributes are discrete, and have very different scales. K-means needs to be able to compute sensible means. And for a binary vector "0.5" is not a sensible mean, it should be either 0 or 1.
Plus, k-means doesn't like outliers too much.
When plotting, make sure to scale them equally, or the result will look incorrect. You X-axis has a length of around 0.9, your y axis only 0.2 - no wonder they look squashed.
Overall, maybe the data set just doesn't have k-means-style clusters? You definitely should try a density-based methods (because these can deal with outliers) such as DBSCAN. But judging from the visualizations you added, I'd say it has at most 4-5 clusters, and they are not really interesting. They probably can be captured with a number of thresholds in some dimensions.
Here is a visualization of the data set after z-normalization, visualized in parallel coordinates, with 5000 samples. Bright green is normal.
You can clearly see special properties of the data set. All of the attacks are clearly different in attributes 3 and 4 (count and srv_count) and also most very concentrated in dst_host_count and dst_host_srv_count.
I've ran OPTICS on this data set, too. It found a number of clusters, most of them in the wine-colored attack pattern. But they're not really interesting. If you have 10 different hosts ping-flooding, they will form 10 clusters.
You can see very well that OPTICS managed to cluster a number of these attacks. It missed all the orange stuff (maybe if I had set minpts lower, it is quite spread out) but it even discovered *structure within the wine-colored attack), breaking it into a number of separate events.
To really make sense of this data set, you should start with feature extraction, for example by merging such ping flood connection attempts into an aggregate event.
Also note that this is an unrealistic scenario.
There are well-known patterns involved in attacks, in particular port scans. These are best detected with a specialized port scan detector, not with learning.
The simulated data has a lot of completely pointless "attacks" simulated. For example Smurf attack from the 90s, is >50% of the data set, and Syn flood is another 20%; while normal traffic is <20%!
For these kind of attacks, there are well-known signatures.
Much of modern attacks (SQL injection, for example) flow with usual HTTP traffic, and will not show up anomalous in raw traffic pattern.
Just don't use this data for classification or outlier detection. Just don't.
Quoting the KDNuggets link above:
As a result, we strongly recommend that
(1) all researchers stop using the KDD Cup '99 dataset,
(2) The KDD Cup and UCI websites include a warning on the KDD Cup '99 dataset webpage informing researchers that there are known problems with the dataset, and
(3) peer reviewers for conferences and journals ding papers (or even outright reject them, as is common in the network security community) with results drawn solely from the KDD Cup '99 dataset.
This is neither real nor realistic data. Go get something else.

First things first: you're asking for a lot here. For future reference: try to break up your problem in smaller chunks, and post several questions. This increases your chances of getting answers (and doesn't cost you 400 reputation!).
Luckily for you, I understand your predicament, and just love this sort of question!
Apart from this dataset's possible issues with k-means, this question is still generic enough to apply also to other datasets (and thus Googlers ending up here looking for a similar thing), so let's go ahead and get this solved.
My suggestion is we edit this answer until you get reasonably satisfactory results.
Number of clusters
Step 1 of any clustering problem: how many clusters to choose? There are a few methods I know of with which you can select the proper number of clusters. There is a nice wiki page about this, containing all of the methods below (and a few more).
Visual inspection
It might seem silly, but if you have well-separated data, a simple plot can tell you already (approximately) how many clusters you'll need, just by looking.
Pros:
quick
simple
works well on well-separated clusters in relatively small datasets
Cons:
and dirty
requires user interaction
it's easy to miss smaller clusters
data with less-well separated clusters, or very many of them, are hard to do by this method
it is all rather subjective -- the next person might select a different amount than you did.
silhouettes plot
As indicated in one of your other questions, making a silhouettes plot will help you make a better decision about the proper number of clusters in your data.
Pros:
relatively simple
reduces subjectivity by using statistical measures
intuitive way to represent quality of the choice
Cons:
requires user interaction
In the limit, if you take as many clusters as there are datapoints, a silhouettes plot will tell you that that is the best choice
it is still rather subjective, not based on statistical means
can be computationally expensive
elbow method
As with the silhouettes plot approach, you run kmeans repeatedly, each time with a larger amount of clusters, and you see how much of the total variance in the data is explained by the clusters chosen by this kmeans run. There will be a number of clusters where the amount of explaned variance will suddenly increase a lot less than in any of the previous choices of the number of clusters (the "elbow"). The elbow is then statistically speaking the best choice for the number of clusters.
Pros:
no user interaction required -- the elbow can be selected automatically
statistically more sound than any of the aforementioned methods
Cons:
somewhat complicated
still subjective, since the definition of the "elbow" depends on subjectively chosen parameters
can be computationally expensive
Outliers
Once you have chosen the number of clusters with any of the methods above, it is time to do outlier detection to see if the quality of your clusters improves.
I would start by a two-step-iterative approach, using the elbow method. In pseudo-Matlab:
data = your initial dataset
dataMod = your initial dataset
MAX = the number of clusters chosen by visual inspection
while (forever)
for N = MAX-5 : MAX+5
if (N < 1), continue, end
perform k-means with N clusters on dataMod
if (variance explained shows a jump)
break
end
if (you are satisfied)
break
end
for i = 1:N
extract all points from cluster i
find the centroid (let k-means do that)
calculate the standard deviation of distances to the centroid
mark points further than 3 sigma as possible outliers
end
dataMod = data with marked points removed
end
The tough part is obviously determining whether you are satisfied.
This is the key to the algorithm's effectiveness. The rough structure of
this part
if (you are satisfied)
break
end
would be something like this
if (situation has improved)
data = dataMod
elseif (situation is same or worse)
dataMod = data
break
end
the situation has improved when there are fewer outliers, or the variance
explaned for ALL choices of N is better than during the previous loop in the while. This is also something to fiddle with.
Anyway, much more than a first attempt I wouldn't call this.
If anyone sees incompletenesses, flaws or loopholes here, please
comment or edit.

Related

K-Means on temporal dataset

I have a temporal dataset(1000000x70) consisting of info about the activities of 20 subjects. I need to apply subsampling to the dataset as it has more than a million rows. How to select a set of observations of each subject ideally from it? Later, I need to apply PCA and K-means on it. Kindly help me with the steps to be followed. I'm working in MATLAB.

I'm not really clear on what you're looking for. If you just want to subsample a matrix on matlab, here is a way to do it:
myData; % 70 x 1000000 data
nbDataPts = size(myData, 2); % Get the number of points in the data
subsampleRatio = 0.1; % Ratio of data you want to keep
nbSamples = round(subsampleRatio * nbDataPts); % How many points to keep
sampleIdx = round(linspace(1, nbDataPts, nbSamples)); % Evenly space indices of the points to keep
sampledData = myData(:, sampleIdx); % Sampling data
Then if you want to apply PCA and K means I suggest you take a look at the relevant documentation:
PCA
K means
Try to work with it, and open a new question if a specific problem arises.

Principal component analysis in matlab?

I have a training set with the size of (size(X_Training)=122 x 125937).
122 is the number of features
and 125937 is the sample size.
From my little understanding, PCA is useful when you want to reduce the dimension of the features. Meaning, I should reduce 122 to a smaller number.
But when I use in matlab:
X_new = pca(X_Training)
I get a matrix of size 125973x121, I am really confused, because this not only changes the features but also the sample size? This is a big problem for me, because I still have the target vector Y_Training that I want to use for my neural network.
Any help? Did I badly interpret the results? I only want to reduce the number of features.

Firstly, the documentation of the PCA function is useful: https://www.mathworks.com/help/stats/pca.html. It mentions that the rows are the samples while the columns are the features. This means you need to transpose your matrix first.
Secondly, you need to specify the number of dimensions to reduce to a priori. The PCA function does not do that for you automatically. Therefore, in addition to extracting the principal coefficients for each component, you also need to extract the scores as well. Once you have this, you simply subset into the scores and perform the reprojection into the reduced space.
In other words:
n_components = 10; % Change to however you see fit.
[coeff, score] = pca(X_training.');
X_reduce = score(:, 1:n_components);
X_reduce will be the dimensionality reduced feature set with the total number of columns being the total number of reduced features. Also notice that the number of training examples does not change as we expect. If you want to make sure that the number of features are along the rows instead of the columns after we reduce the number of features, transpose this output matrix as well before you proceed.
Finally, if you want to automatically determine the number of features to reduce to, one method to do so is to calculate the variance explained of each feature, then accumulate the values from the first feature up to the point where we exceed some threshold. Usually 95% is used.
Therefore, you need to provide additional output variables to capture these:
[coeff, score, latent, tsquared, explained, mu] = pca(X_training.');
I'll let you go through the documentation to understand the other variables, but the one you're looking at is the explained variable. What you should do is find the point where the total variance explained exceeds 95%:
[~,n_components] = max(cumsum(explained) >= 95);
Finally, if you want to perform a reconstruction and see how well the reconstruction into the original feature space performs from the reduced feature, you need to perform a reprojection into the original space:
X_reconstruct = bsxfun(#plus, score(:, 1:n_components) * coeff(:, 1:n_components).', mu);
mu are the means of each feature as a row vector. Therefore you need add this vector across all examples, so broadcasting is required and that's why bsxfun is used. If you're using MATLAB R2018b, this is now implicitly done when you use the addition operation.
X_reconstruct = score(:, 1:n_components) * coeff(:, 1:n_components).' + mu;

Clustering an image using Gaussian mixture models

I want to use GMM(Gaussian mixture models for clustering a binary image and also want to plot the cluster centroids on the binary image itself.
I am using this as my reference:
http://in.mathworks.com/help/stats/gaussian-mixture-models.html
This is my initial code
I=im2double(imread('sil10001.pbm'));
K = I(:);
mu=mean(K);
sigma=std(K);
P=normpdf(K, mu, sigma);
Z = norminv(P,mu,sigma);
X = mvnrnd(mu,sigma,1110);
X=reshape(X,111,10);
scatter(X(:,1),X(:,2),10,'ko');
options = statset('Display','final');
gm = fitgmdist(X,2,'Options',options);
idx = cluster(gm,X);
cluster1 = (idx == 1);
cluster2 = (idx == 2);
scatter(X(cluster1,1),X(cluster1,2),10,'r+');
hold on
scatter(X(cluster2,1),X(cluster2,2),10,'bo');
hold off
legend('Cluster 1','Cluster 2','Location','NW')
P = posterior(gm,X);
scatter(X(cluster1,1),X(cluster1,2),10,P(cluster1,1),'+')
hold on
scatter(X(cluster2,1),X(cluster2,2),10,P(cluster2,1),'o')
hold off
legend('Cluster 1','Cluster 2','Location','NW')
clrmap = jet(80); colormap(clrmap(9:72,:))
ylabel(colorbar,'Component 1 Posterior Probability')
But the problem is that I am unable to plot the cluster centroids received from GMM in the primary binary image.How do i do this?
**Now suppose i have 10 such images in a sequence And i want to store the information of their mean position in two cell array then how do i do that.This is my code foe my new question **
images=load('gait2go.mat');%load the matrix file
for i=1:10
I{i}=images.result{i};
I{i}=im2double(I{i});
%determine 'white' pixels, size of image can be [M N], [M N 3] or [M N 4]
Idims=size(I{i});
whites=true(Idims(1),Idims(2));
df=I{i};
%we add up the various color channels
for colori=1:size(df,3)
whites=whites & df(:,:,colori)>0.5;
end
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(whites);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
cInd = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
%get clusterwise means
meanx=zeros(K,1);
meany=zeros(K,1);
for i=1:K
meanx(i)=mean(datax(cInd==i));
meany(i)=mean(datay(cInd==i));
end
xc{i}=meanx(i);%cell array contaning the position of the mean for the 10
images
xb{i}=meany(i);
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to
image
axis equal;
hold on;
scatter(meany,-meanx,20,'+'); %same funky coordinates
end
I am able to get 10 images segmented but no the values of themean stored in the cell arrays xc and xb.They r only storing [] in place of the values of means

I decided to post an answer to your question (where your question was determined by a maximum-likelihood guess:P), but I wrote an extensive introduction. Please read carefully, as I think you have difficulties understanding the methods you want to use, and you have difficulties understanding why others can't help you with your usual approach of asking questions. There are several problems with your question, both code-related and conceptual. Let's start with the latter.
The problem with the problem
You say that you want to cluster your image with Gaussian mixture modelling. While I'm generally not familiar with clustering, after a look through your reference and the wonderful SO answer you cited elsewhere (and a quick 101 from #rayryeng) I think you are on the wrong track altogether.
Gaussian mixture modelling, as its name suggests, models your data set with a mixture of Gaussian (i.e. normal) distributions. The reason for the popularity of this method is that when you do measurements of all sorts of quantities, in many cases you will find that your data is mostly distributed like a normal distribution (which is actually the reason why it's called normal). The reason behind this is the central limit theorem, which implies that the sum of reasonably independent random variables tends to be normal in many cases.
Now, clustering, on the other hand, simply means separating your data set into disjoint smaller bunches based on some criteria. The main criterion is usually (some kind of) distance, so you want to find "close lumps of data" in your larger data set. You usually need to cluster your data before performing a GMM, because it's already hard enough to find the Gaussians underlying your data without having to guess the clusters too. I'm not familiar enough with the procedures involved to tell how well GMM algorithms can work if you just let them work on your raw data (but I expect that many implementations start with a clustering step anyway).
To get closer to your question: I guess you want to do some kind of image recognition. Looking at the picture, you want to get more strongly correlated lumps. This is clustering. If you look at a picture of a zoo, you'll see, say, an elephant and a snake. Both have their distinct shapes, and they are well separated from one another. If you cluster your image (and the snake is not riding the elephant, neither did it eat it), you'll find two lumps: one lump elephant-shaped, and one lump snake-shaped. Now, it wouldn't make sense to use GMM on these data sets: elephants, and especially snakes, are not shaped like multivariate Gaussian distributions. But you don't need this in the first place, if you just want to know where the distinct animals are located in your picture.
Still staying with the example, you should make sure that you cluster your data into an appropriate number of subsets. If you try to cluster your zoo picture into 3 clusters, you might get a second, spurious snake: the nose of the elephant. With an increasing number of clusters your partitioning might make less and less sense.
Your approach
Your code doesn't give you anything reasonable, and there's a very good reason for that: it doesn't make sense from the start. Look at the beginning:
I=im2double(imread('sil10001.pbm'));
K = I(:);
mu=mean(K);
sigma=std(K);
X = mvnrnd(mu,sigma,1110);
X=reshape(X,111,10);
You read your binary image, convert it to double, then stretch it out into a vector and compute the mean and deviation of that vector. You basically smear your intire image into 2 values: an average intensity and a deviation. And THEN you generate 111*10 standard normal points with these parameters, and try to do GMM on the first two sets of 111. Which are both independently normal with the same parameter. So you probably get two overlapping Gaussians around the same mean with the same deviation.
I think the examples you found online confused you. When you do GMM, you already have your data, so no pseudo-normal numbers should be involved. But when people post examples, they also try to provide reproducible inputs (well, some of them do, nudge nudge wink wink). A simple method for this is to generate a union of simple Gaussians, which can then be fed into GMM.
So, my point is, that you don't have to generate random numbers, but have to use the image data itself as input to your procedure. And you probably just want to cluster your image, instead of actually using GMM to draw potatoes over your cluster, since you want to cluster body parts in an image about a human. Most body parts are not shaped like multivariate Gaussians (with a few distinct exceptions for men and women).
What I think you should do
If you really want to cluster your image, like in the figure you added to your question, then you should use a method like k-means. But then again, you already have a program that does that, don't you? So I don't really think I can answer the question saying "How can I cluster my image with GMM?". Instead, here's an answer to "How can I cluster my image?" with k-means, but at least there will be a piece of code here.
%set infile to what your image file will be
infile='sil10001.pbm';
%read file
I=im2double(imread(infile));
%determine 'white' pixels, size of image can be [M N], [M N 3] or [M N 4]
Idims=size(I);
whites=true(Idims(1),Idims(2));
%we add up the various color channels
for colori=1:Idims(3)
whites=whites & I(:,:,colori)>0.5;
end
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(whites);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
cInd = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
%get clusterwise means
meanx=zeros(K,1);
meany=zeros(K,1);
for i=1:K
meanx(i)=mean(datax(cInd==i));
meany(i)=mean(datay(cInd==i));
end
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to image
axis equal;
hold on;
scatter(meany,-meanx,20,'ko'); %same funky coordinates
Here's what this does. It first reads your image as double like yours did. Then it tries to determine "white" pixels by checking that each color channel (of which can be either 1, 3 or 4) is brighter than 0.5. Then your input data points to the clustering will be the x and y "coordinates" (i.e. indices) of your white pixels.
Next it does the clustering via kmeans. This part of the code is loosely based on the already cited answer of Amro. I had to set a large maximal number of iterations, as the problem is ill-posed in the sense that there aren't 10 clear clusters in the picture. Then we compute the mean for each cluster, and plot the clusters with gscatter, and the means with scatter. Note that in order to have the picture facing in the right directions in a scatter plot you have to shift around the input coordinates. Alternatively you could define datax and datay correspondingly at the beginning.
And here's my output, run with the already processed figure you provided in your question:

I do believe you must had made a naive mistake in the plot and that's why you see just a straight line: You are plotting only the x values.
In my opinion, the second argument in the scatter command should be X(cluster1,2) or X(cluster2,2) depending on which scatter command is being used in the code.

The code can be made more simple:
%read file
I=im2double(imread('sil10340.pbm'));
%choose indices of 'white' pixels as coordinates of data
[datax datay]=find(I);
%cluster data into 10 clumps
K = 10; % number of mixtures/clusters
[cInd, c] = kmeans([datax datay], K, 'EmptyAction','singleton',...
'maxiter',1000,'start','cluster');
figure;
gscatter(datay,-datax,cInd); %funky coordinates for plotting according to
image
axis equal;
hold on;
scatter(c(:,2),-c(:,1),20,'ko'); %same funky coordinates
I don't think there is nay need for the looping as the c itself return a 10x2 double array which contains the position of the means

Measuring the entropy of a transition probability matrix in matlab

I'm working on a project which requires to analyze certain graph properties of transition probability matrices which are constructed as weighted directed graphs.
one of the properties of interest is the entropy of these graphs, which i have yet to find a proper way to measure, the general idea is that i need some sort of measure which allows me to quantify the extent to which a certain graph is "ordered" in order to ascertain the predictive value of the nodes within the graph (I.E if all the nodes have the exact same connection patterns, then effectively their predictive value is zero, though this is a very simplistic explanation as there are many other contributing factors to a nodes predictive power).
Iv'e experimented with certain built in matlab commands:
entropy - generally used to determine the entropy of an image
wentropy - to be honest i do not fully understand the proper use of this function, but iv'e tried using it with the 'shannon' and 'log energy' types, and have produced some incosistent results
this is a very basic script i whipped up to some testing, which produces two matrices:
an 20*20 matrix constructed with values drawn entirely from a uniform distribution, intended to produce a matrix with a relatively low degree of order - unordgraph
a 20*20 matrix constructed with 4 5*5 "patches" in which the values are integers drawn from a uniform distribution with a given range that is significantly larger than one, while the rest of the values are drawn from a uniform distribution on the range 0-1 (as in the previous matrix), this form of graph is more "ordered" than the previous patch - ordgraph
when i run the code:
clear all;
n = 50;
gsize = 20;
orderedrange = [100 200];
enttype = 'shannon';
for i = 1:n;
unordgraph = rand(gsize);
% entvec(1,i) = entropy(unordgraph);
entvec(1,i) = wentropy(unordgraph,enttype);
% ordgraph = reshape(1:gsize^2,gsize,gsize);
ordgraph = rand(gsize);
ordgraph(1:5,1:5) = randi(orderedrange,5);
ordgraph(6:10,6:10) = randi(orderedrange,5);
ordgraph(11:15,11:15) = randi(orderedrange,5);
ordgraph(16:20,16:20) = randi(orderedrange,5);
% entvec(2,i) = entropy(ordgraph);
entvec(2,i) = wentropy(ordgraph,enttype);
end
fprintf('the mean entropy of the unordered graph is: %.4f\n',mean(entvec(1,:)));
fprintf('the mean entropy of the ordered graph is: %.4f\n',mean(entvec(2,:)));
i get outputs such as:
the mean entropy of the unordered graph is: 88.8871
the mean entropy of the ordered graph is: -23936552.0113
i'm not really sure about the meaning of such negative values as running the same script on a matrix comprised entirely of zeros or ones (and hence maximally ordered) produces a mean entropy of 0.
i have a pretty rudimentary background in graph theory, making this task that much more difficult, and i would be really grateful for any help, whether theoretical or algorithmical
thanks in advance,
Ron

How to generate this shape in Matlab?

In matlab, how to generate two clusters of random points like the following graph. Can you show me the scripts/code?

If you want to generate such data points, you will need to have their probability distribution to be able to generate the points.
For your point, I do not have the real distributions, so I can only give an approximation. From your figure I see that both lay approximately on a circle, with a random radius and a limited span for the angle. I assume those angles and radii are uniformly distributed over certain ranges, which seems like a pretty good starting point.
Therefore it also makes sense to generate the random data in polar coordinates (i.e. angle and radius) instead of the cartesian ones (i.e. horizontal and vertical), and transform them to allow plotting.
C1 = [0 0]; % center of the circle
C2 = [-5 7.5];
R1 = [8 10]; % range of radii
R2 = [8 10];
A1 = [1 3]*pi/2; % [rad] range of allowed angles
A2 = [-1 1]*pi/2;
nPoints = 500;
urand = #(nPoints,limits)(limits(1) + rand(nPoints,1)*diff(limits));
randomCircle = #(n,r,a)(pol2cart(urand(n,a),urand(n,r)));
[P1x,P1y] = randomCircle(nPoints,R1,A1);
P1x = P1x + C1(1);
P1y = P1y + C1(2);
[P2x,P2y] = randomCircle(nPoints,R2,A2);
P2x = P2x + C2(1);
P2y = P2y + C2(2);
figure
plot(P1x,P1y,'or'); hold on;
plot(P2x,P2y,'sb'); hold on;
axis square
This yields:
This method works relatively well when you deal with distributions that you can transform easily and when you can easily describe the possible locations of the points. If you cannot, there are other methods such as the inverse transforming sampling method which offer algorithms to generate the data instead of manual variable transformations as I did here.

K-means is not going to give you what you want.
For K-means, vectors are classified based on their nearest cluster center. I can only think of two ways you could get the non-convex assignment shown in the picture:
Your input data is actually higher-dimensional, and your sample image is just a 2-d projection.
You're using a distance metric with different scaling across the dimensions.
To achieve your aim:
Use a non-linear clustering algorithm.
Apply a non-linear transform to your input data. (Probably not feasible).
You can find a list on non-linear clustering algorithms here. Specifically, look at this reference on the MST clustering page. Your exact shape appears on the fourth page of the PDF together with a comparison of what happens with K-Means.
For existing MATLAB code, you could try this Kernel K-Means implementation. Also, check out the Clustering Toolbox.

Assuming that you really want to do the clustering operation on existing data, as opposed to generating the data itself. Since you have a plot of some data, it seems logical that you already know how to do that! If I am wrong in this assumption, then you should word your questions more carefully in the future.
The human brain is quite good at seeing patterns in things like this, that writing a code for on a computer will often take some serious effort.
As has been said already, traditional clustering tools such as k-means will fail. Luckily, the image processing toolbox has good tools for these purposes already written. I might suggest converting the plot into an image, using filled in dots to plot the points. Make sure the dots are large enough that they touch each other within a cluster, with some overlap. Then use dilation/erosion tools if necessary to make sure that any small cracks are filled in, but don't go so far as to cause the clusters to merge. Finally, use region segmentation tools to pick out the clusters. Once done, transform back from pixel units in the image into your spatial units, and you have accomplished your task.
For the image processing approach to work, you will need sufficient separation between the clusters compared to the coarseness within a cluster. But that seems obvious for any method to succeed.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse