How can I color-label the cluster data after GMM is fitted?

How can I color-label the cluster data after GMM is fitted? - matlab

I am trying to do some labelling on cluster data following GMMs but haven't found a way to do it.
Let me explain:
I have some x,y data pairs into a X=30000x2 array. In reality the array contains the data from different sources (known) and each source has the same number of data (So source 1 has 500 (x,y), source 2 500 (x,y) and so on and all of them are appended into the X array above).
I have fitted a GMM on X. Cluster results are fine and as expected but now that the data are clustered I want to be able to color code them based on their initial origin.
So let's say I want to shown in black the data points of source 1 that are in cluster 2.
Is that possible?
Example:
In the original array we have three sources for the data. Source 1 is data from 1-10000, source 2 10001-20000 and source 3 20001-30000.
After GMM fitting and clustering I have clustered my data as per figure 1 and I got two clusters. The red colour in all of them is irrelevant.
I want to modify the color of the data points in cluster 2 based on their index and the original array X.
E.g., if a data point belongs to cluster 2 (clusteridx=2), then I want to check to which source it belongs and then color it and label it accordingly. So that you can tell from which source are the data points in cluster 2 as shown in the second figure.
Original clusters
Desired labelling

You could add a "source_id" column and then plot through a loop on that. For example:
% setup fake data
source1 = rand(10,2);
source2 = rand(15,2);
source3 = rand(8,2);
% end setup
% append column with source_id (you could do this in a loop if you have many sources)
source1 = [source1, repmat(1, length(source1), 1)];
source2 = [source2, repmat(2, length(source2), 1)];
source3 = [source3, repmat(3, length(source3), 1)];
mytable = array2table([source1; source2; source3]);
mytable.Properties.VariableNames = {'X' 'Y' 'source_id'};
figure
hold on;
for ii = 1:max(mytable.source_id)
rows = mytable.source_id==ii;
x = mytable.X(rows);
y = mytable.Y(rows);
label = char(strcat('Source ID =', {' '}, num2str(ii)));
mycolor = rand(1,3);
scatter(x,y, 'MarkerEdgeColor', mycolor, 'MarkerFaceColor', mycolor, 'DisplayName', label);
end
set(legend, 'Location', 'best')

Related

Creating plots based on single column variable

I'm fairly new to the Matlab community and need help with a particular plotting task! Any assistance would be greatly appreciated.
I've been tasked with creating an automated process that produces numerous 2d line graphs, using X (Elevation) and Y (Chainage) data based on survey data we have gathered in the field. This XY data needs to be split into differing figures in accordance to a variable within a 'Profile_ID' column. An example of the data is shown below:
Easting
Northing
Elevation
Chainage
FC
Name
Profile_ID
219578.603
101400.293
6.675
133.393
CE
N/A
7b01346
219577.925
101400.621
6.088
134.146
X
N/A
7b01346
219577.833
101400.709
6.037
134.267
X
N/A
7b01346
219577.378
101400.789
5.904
134.714
X
N/A
7b01346
219577.319
101400.987
5.887
134.850
X
N/A
7b01346
The PROFILE_ID changes throughout the .txt file. The file is ordered based on profile_id and then chainage
However, I also need to overlay previous survey data to the same corresponding 'Profile_ID' graph. So, essentially I have 2 data sets which have an identical column layout, just with differing X and Y data. One is from a previous survey and one from the newest survey. I was hoping to find a way that allows me to run a for loop to create a figure for every iteration of 'profile_id' and then also overlay the previous surveys data, which has the same 'profile_id'.
I hope that this all makes sense, i've linked an example here: Example of desired graph produced by the script, for one iteration of 'Profile_ID'
Cheers!
clc
clear
close all
%Import inputs
point_file_old = readtable('7b7B3-2_20170627tp.csv'); %Input older file name here
matrix_profile_old = table2array(point_file_old(:,3:4)); %Extracting elevation & chainage column
id_old = point_file_old(:,7);
L_old = length(matrix_profile_old(:,1));
point_file_new = readtable('20220430_7b7B3-2tp.csv'); %Input newer file name here
matrix_profile_new = table2array(point_file_new(:,3:4)); %Extracting elevation & chainage column
id_new = point_file_new(:,7);
L_new = length(matrix_profile_new(:,1));
%Settings
chainage_old = matrix_profile_old(:,2); %Identifying old chainage column
elevation_old = matrix_profile_old(:,1); %Identifying old elevation column
chain_old_num = length(chainage_old); %Amount of rows in chainage
elev_old_num = length(elevation_old) %Amount of rows in elevation
chainage_new = matrix_profile_new(:,2); %Identifying old chainage column
elevation_new = matrix_profile_new(:,1); %Identifying old elevation column
chain_new_num = length(chainage_new); %Amount of rows in chainage
elev_new_num = length(elevation_new); %Amount of rows in elevation
t_old = table(chainage_old(:,1), elevation_old(:,1), table2array(id_old(:,1)));
t_new = table(chainage_new(:,1), elevation_new(:,1), table2array(id_new(:,1)));
G = findgroups(t_new(:,3));
temp = splitapply(#(varargin) {sortrows(table(varargin{:}),3)}, t_new, G); %order separated groups in terms of chainage
So I currently have the data sorted into groups and now need to plot each individual group and then finally overlay the previous data to corresponding group data.

One approach is to combine all of the data (old and new) into a single table, then find the unique IDs and loop over them. For each ID you can identify the relevant rows and plot them.
Note that the table2array usage in your current code makes your life harder, tables are nice because you can index columns using the column names directly, so this:
table2array(point_file_new(:,3));
becomes this:
point_file_new.Elevation;
Commented code below:
data_old = readtable('7b7B3-2_20170627tp.csv'); %Input older file name here
data_old.Source(:) = {'Old'}; % Add this column so we can track the source later
data_new = readtable('20220430_7b7B3-2tp.csv'); %Input newer file name here
data_new.Source(:) = {'New'}; % Add this column so we can track the source later
data_all = [data_old; data_new]; % combine data into single table
IDs = unique( data_all.Profile_ID ); % Get unique Profile_ID values
NID = numel(IDs); % Number of unique IDs
for ii = 1:NID
ID = IDs{ii}; % Current ID
idxID = ismember( data_all.Profile_ID, ID ); % Rows with this ID
idxOld = strcmp(data_all.Source, 'Old'); % Rows from old data
idxOld = idxOld & idxID; % Rows from old data and this ID
idxNew = strcmp(data_all.Source, 'New'); % Rows from new data
idxNew = idxNew & idxID; % Rows from new data and this ID
figure(); % Make a new figure for this ID
hold on; % hold so we can plot multiple lines
plot( data_all.Elevation(idxOld), data_all.Chainage(idxOld), 'displayname', 'Old data' ); % plot old
plot( data_all.Elevation(idxNew), data_all.Chainage(idxNew), 'displayname', 'New data' ); % plot new
% Add labels/title
xlabel( 'Chainage (m)' );
ylabel( 'Elevation (m)' );
title( ID );
grid on;
hold off; % done plotting
legend('show','location','best');
end

How can I merge tables with different numbers of rows?

I'm currently trying to create a signal process diagram in MATLAB. In order to do this, I have 3 tables that I would like to plot different signals from that would require merging in order to be plotted on the same graph (but separated out so to see the signals separately).
So far I have tried:
% The variables below are examples of the tables that contain
% the variables I would like to plot.
s1 = table(data1.Time, data1.Var1); % This is a 8067x51 table
s2 = table(data2.Time, data2.Var2); % This is a 2016x51 table
s3 = table(data3.Time, data3.Var3); % This is a 8065x51 table
% This gives an error of 'must contain same amount of rows.'
S = [s1, s2, s3];
% This puts the three tables into a cell array
S = {s1, s2, s3};
Any suggestions welcome.

You were close. You just need to concatenate your tables vertically instead of horizontally:
S = [s1; s2; s3];
% Or in functional form
S = vertcat(s1, s2, s3);
Note that this only works if all the tables have the same number of variables (i.e. columns).

MATLAB: vectors of different length

I want to create a MATLAB function to import data from files in another directory and fit them to a given model, but because the data need to be filtered (there's "thrash" data in different places in the files, eg. measurements of nothing before the analyzed motion starts).
So the vectors that contain the data used to fit end up having different lengths and so I can't return them in a matrix (eg. x in my function below). How can I solve this?
I have a lot of datafiles so I don't want to use a "manual" method. My function is below. All and suggestions are welcome.
datafit.m
function [p, x, y_c, y_func] = datafit(pattern, xcol, ycol, xfilter, calib, p_calib, func, p_0, nhl)
datafiles = dir(pattern);
path = fileparts(pattern);
p = NaN(length(datafiles));
y_func = [];
for i = 1:length(datafiles)
exist(strcat(path, '/', datafiles(i).name));
filename = datafiles(i).name;
data = importdata(strcat(path, '/', datafiles(i).name), '\t', nhl);
filedata = data.data/1e3;
xdata = filedata(:,xcol);
ydata = filedata(:,ycol);
filter = filedata(:,xcol) > xfilter(i);
x(i,:) = xdata(filter);
y(i,:) = ydata(filter);
y_c(i,:) = calib(y(i,:), p_calib);
error = #(par) sum(power(y_c(i,:) - func(x(i,:), par),2));
p(i,:) = fminsearch(error, p_0);
y_func = [y_func; func(x(i,:), p(i,:))];
end
end
sample data: http://hastebin.com/mokocixeda.md

There are two strategies I can think of:
I would return the data in a vector of cells instead, where the individual cells store vectors of different lengths. You can access data the same way as arrays, but use curly braces: Say c{1}=[1 2 3], c{2}=[1 2 10 8 5] c{3} = [ ].
You can also filter the trash data upon reading a line, if that makes your vectors have the same length.

If memory is not an major issue, try filling up the vectors with distinct values, such as NaN or Inf - anything, that is not found in your measurements based on their physical context. You might need to identify the longest data-set before you allocate memory for your matrices (*). This way, you can use equally sized matrices and easily ignore the "empty data" later on.
(*) Idea ... allocate memory based on the size of the largest file first. Fill it up with e.g. NaN's
matrix = zeros(length(datafiles), longest_file_line_number) .* NaN;
Then run your function. Determine the length of the longest consecutive set of data.
new_max = length(xdata(filter));
if new_max > old_max
old_max = new_max;
end
matrix(i, length(xdata(filter))) = xdata(filter);
Crop your matrix accordingly, before the function returns it ...
matrix = matrix(:, 1:old_max);

about labeling the x axis

I have many data need to be plotted as waterfall in matlab. I have more than 10 columns of data, each column represents one data data set. I put all data in a big matrix such that the first data set put in the first row of matrix, the second data set will be in the second row ... etc. After all those data stored in a matrix, I use the waterfall to plot those data. For each column, it contains about 10,000 data points which corresponds to x variable ranged from -5 to 5. But in the waterfall, it shows 0 to 10, 000 instead of -5 to 5 in the x axis. How do I force matlab to show the correct range? thx
mydata = zeros(13, 10000);
mydata(1, :) = ... ; % first data set
mydata(2, :) = ... ; % second data set
...
mydata(13, :) = ... ; % last data set
waterfall(mydata)

If you look at the documentation for waterfall (you can do this easily by placing the cursor in the command in your editor and hitting F1), you will see that you can invoke the waterfall command with different syntax . .
% Syntax
waterfall(Z)
waterfall(X,Y,Z)
waterfall(...,C)
waterfall(axes_handles,...)
h = waterfall(...)
Rather than just call the waterfall plot with the data Z, supply it with the X and Y range data also. For example . . .
mydata = rand(13, 10000);
Y = 1:size(mydata,1);
X = linspace(-5, 5,size(mydata,2));
waterfall(X, Y , mydata)

Kmean plotting in matlab

I am on a project thumb recognition system on matlab. I implemented Kmean Algorithm and I got results as well. Actually now I want to plot the results like here they done. I am trying but couldn't be able to do so. I am using the following code.
load training.mat; % loaded just to get trainingData variable
labelData = zeros(200,1);
labelData(1:100,:) = 0;
labelData(101:200,:) = 1;
k=2;
[trainCtr, traina] = kmeans(trainingData,k);
trainingResult1=[];
for i=1:k
trainingResult1 = [trainingResult1 sum(trainCtr(1:100)==i)];
end
trainingResult2=[];
for i=1:k
trainingResult2 = [trainingResult2 sum(trainCtr(101:200)==i)];
end
load testing.mat; % loaded just to get testingData variable
c1 = zeros(k,1054);
c1 = traina;
cluster = zeros(200,1);
for j=1:200
testTemp = repmat(testingData(j,1:1054),k,1);
difference = sum((c1 - testTemp).^2, 2);
[value index] = min(difference);
cluster(j,1) = index;
end
testingResult1 = [];
for i=1:k
testingResult1 = [testingResult1 sum(cluster(1:100)==i)];
end
testingResult2 = [];
for i=1:k
testingResult2 = [testingResult2 sum(cluster(101:200)==i)];
end
in above code trainingData is matrix of 200 X 1054 in which 200 are images of thumbs and 1054 are columns. actually each image is of 25 X 42. I reshaped each image in to row matrix (1 X 1050) and 4 other (some features) columns so total of 1054 columns are in each image. Similarly testingData I made it in the similar manner as I made testingData It is also the order of 200 X 1054. Now my Problem is just to plot the results as they did in here.

After selecting 2 features, you can just follow the example. Start a figure, use hold on, and use plot or scatter to plot the centroids and the data points. E.g.
selectedFeatures = [42,43];
plot(trainingData(trainCtr==1,selectedFeatures(1)),
trainingData(trainCtr==1,selectedFeatures(2)),
'r.','MarkerSize',12)
Would plot the selected feature values of the data points in cluster 1.