I am unclear about why k-means clustering can have overlap in clusters. From Chen (2018) I saw the following definition:
"..let the observations be a sample set to be partitioned into K disjoint clusters"
However I see an overlap in my plots, and am not sure why this is the case.
For reference, I am trying to cluster a multi-dimensional dataset with three variables (Recency, Frequency, Revenue). To visualize clustering, I can project 3D data into 2D using PCA and run k-means on that. Below is the code and plot I get:
df1=tx_user[["Recency","Frequency","Revenue"]]
#standardize
names = df1.columns
# Create the Scaler object
scaler = preprocessing.StandardScaler()
# Fit your data on the scaler object
scaled_df1 = scaler.fit_transform(df1)
df1 = pd.DataFrame(scaled_df1, columns=names)
df1.head()
del scaled_df1
sklearn_pca = PCA(n_components = 2)
X1 = sklearn_pca.fit_transform(df1)
X1 = X1[:, ::-1] # flip axes for better plotting
kmeans = KMeans(3, random_state=0)
labels = kmeans.fit(X1).predict(X1)
plt.scatter(X1[:, 0], X1[:, 1], c=labels, s=40, cmap='viridis');
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
def plot_kmeans(kmeans, X, n_clusters=4, rseed=0, ax=None):
labels = kmeans.fit_predict(X)
# plot the input data
ax = ax or plt.gca()
ax.axis('equal')
#ax.set_ylim(-5000,7000)
ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
# plot the representation of the KMeans model
centers = kmeans.cluster_centers_
radii = [cdist(X[labels == i], [center]).max()
for i, center in enumerate(centers)]
for c, r in zip(centers, radii):
ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.5, zorder=1))
kmeans = KMeans(n_clusters=4, random_state=0)
plot_kmeans(kmeans, X1)
My question is:
1. Why is there an overlap? Is my clustering wrong if there is?
2. How does k-means decide cluster assignment incase there is an overlap?
Thank you
Reference:
Chen, L., Xu, Z., Wang, H., & Liu, S. (2018). An ordered clustering algorithm based on K-means and the PROMETHEE method. International Journal of Machine Learning and Cybernetics, 9(6), 917-926.
K-means computes k clusters by average approximation. Each cluster is defined by their computed center and thus is unique by definition.
Sample assignment is made to cluster with closest distance from cluster center, also unique by definition. Thus in this sense there is NO OVERLAP.
However for given distance d>0 a sample may be within d-distance to more than one cluster center (it is possible). This is what you see when you say overlap. However still the sample is assigned to closest cluster not to all of them. So no overlap.
NOTE: In the case where a sample has exactly same closest distance to more than one cluster center any random assignment can be made between the closest clusters and this changes nothing important in the algorithm or results since clusters are re-computed after assignment.
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K-pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
Perhaps you did something wrong... I don't have your data, so I can't test it. You can add boundaries, and check those. See the sample code below.
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import Voronoi
def voronoi_finite_polygons_2d(vor, radius=None):
"""
Reconstruct infinite voronoi regions in a 2D diagram to finite
regions.
Parameters
----------
vor : Voronoi
Input diagram
radius : float, optional
Distance to 'points at infinity'.
Returns
-------
regions : list of tuples
Indices of vertices in each revised Voronoi regions.
vertices : list of tuples
Coordinates for revised Voronoi vertices. Same as coordinates
of input vertices, with 'points at infinity' appended to the
end.
"""
if vor.points.shape[1] != 2:
raise ValueError("Requires 2D input")
new_regions = []
new_vertices = vor.vertices.tolist()
center = vor.points.mean(axis=0)
if radius is None:
radius = vor.points.ptp().max()*2
# Construct a map containing all ridges for a given point
all_ridges = {}
for (p1, p2), (v1, v2) in zip(vor.ridge_points, vor.ridge_vertices):
all_ridges.setdefault(p1, []).append((p2, v1, v2))
all_ridges.setdefault(p2, []).append((p1, v1, v2))
# Reconstruct infinite regions
for p1, region in enumerate(vor.point_region):
vertices = vor.regions[region]
if all([v >= 0 for v in vertices]):
# finite region
new_regions.append(vertices)
continue
# reconstruct a non-finite region
ridges = all_ridges[p1]
new_region = [v for v in vertices if v >= 0]
for p2, v1, v2 in ridges:
if v2 < 0:
v1, v2 = v2, v1
if v1 >= 0:
# finite ridge: already in the region
continue
# Compute the missing endpoint of an infinite ridge
t = vor.points[p2] - vor.points[p1] # tangent
t /= np.linalg.norm(t)
n = np.array([-t[1], t[0]]) # normal
midpoint = vor.points[[p1, p2]].mean(axis=0)
direction = np.sign(np.dot(midpoint - center, n)) * n
far_point = vor.vertices[v2] + direction * radius
new_region.append(len(new_vertices))
new_vertices.append(far_point.tolist())
# sort region counterclockwise
vs = np.asarray([new_vertices[v] for v in new_region])
c = vs.mean(axis=0)
angles = np.arctan2(vs[:,1] - c[1], vs[:,0] - c[0])
new_region = np.array(new_region)[np.argsort(angles)]
# finish
new_regions.append(new_region.tolist())
return new_regions, np.asarray(new_vertices)
# make up data points
np.random.seed(1234)
points = np.random.rand(15, 2)
# compute Voronoi tesselation
vor = Voronoi(points)
# plot
regions, vertices = voronoi_finite_polygons_2d(vor)
print("--")
print(regions)
print("--")
print(vertices)
# colorize
for region in regions:
polygon = vertices[region]
plt.fill(*zip(*polygon), alpha=0.4)
plt.plot(points[:,0], points[:,1], 'ko')
plt.axis('equal')
plt.xlim(vor.min_bound[0] - 0.1, vor.max_bound[0] + 0.1)
plt.ylim(vor.min_bound[1] - 0.1, vor.max_bound[1] + 0.1)
Great resource here.
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html
Related
Suppose that I have generated some data in matlab as follows:
n = 100;
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
plot(x,y,'rx')
axis([0 100 0 1])
Now I want to generate an algorithm to classify all these data into some clusters(which are arbitrary) in a way such that a point be a member of a cluster only if the distance between this point and at least one of the members of the cluster be less than 10.How could I generate the code?
The clustering method you are describing is DBSCAN. Note that this algorithm will find only one cluster in provided data, since it's very unlikely that there is a point in the dataset so that its distance to all other points is more than 10.
If this is really what you want, you can use ِDBSCAN, or the one posted in FE, if you are using versions older than 2019a.
% Generating random points, almost similar to the data provided by OP
data = bsxfun(#times, rand(100, 2), [100 1]);
% Adding more random points
for i=1:5
mu = rand(1, 2)*100 -50;
A = rand(2)*5;
sigma = A*A'+eye(2)*(1+rand*2);%[1,1.5;1.5,3];
data = [data;mvnrnd(mu,sigma,20)];
end
% clustering using DBSCAN, with epsilon = 10, and min-points = 1 as
idx = DBSCAN(data, 10, 1);
% plotting clusters
numCluster = max(idx);
colors = lines(numCluster);
scatter(data(:, 1), data(:, 2), 30, colors(idx, :), 'filled')
title(['No. of Clusters: ' num2str(numCluster)])
axis equal
The numbers in above figure shows the distance between closest pairs of points in any two different clusters.
The Matlab built-in function clusterdata() works well for what you're asking.
Here is how to apply it to your example:
% number of points
n = 100;
% create the data
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
% the number of clusters you want to create
num_clusters = 5;
T1 = clusterdata(data,'Criterion','distance',...
'Distance','euclidean',...
'MaxClust', num_clusters)
scatter(x, y, 100, T1,'filled')
In this case, I used 5 clusters and used the Euclidean distance to be the metric to group the data points, but you can always change that (see documentation of clusterdata())
See the result below for 5 clusters with some random data.
Note that the data is skewed (x-values are from 0 to 100, and y-values are from 0 to 1), so the results are also skewed, but you could always normalize your data.
Here is a way using the connected components of graph:
D = pdist2(x, y) < 10;
D(1:size(D,1)+1:end) = 0;
G = graph(D);
C = conncomp(G);
The connected components is vector that shows the cluster numbers.
Use pdist2 to compute distance matrix of x and y.
Use the distance matrix to create a logical adjacency matrix that shows two point are neighbors if distance between them is less than 10.
Set the diagonal elements of the adjacency matrix to 0 to eliminate self loops.
Create a graph from the adjacency matrix.
Compute the connected components of graph.
Note that using pdist2 for large datasets may not be applicable and you need to use other methods to form a sparse adjacency matrix.
I notified after posing my answer the answer provided by #saastn suggested to use DBSCAN algorithm that nearly follows the same approach.
I am trying to use skimage.measure.marching_cubes_lewiner to resolve some isosurface f(x,y,z)=0. In my case f is strongly nonlinear, and is best mapped when coordinates are given with logarithmic spacing. Because the marching cubes wants a regular grid, to build the voxels, I am working on a meshgrid of coordinates X,Y,Z which correspond to the log10 of my original coordinates, so that my isosurface is equivalently given by f(10**X,10**Y,10**Z)=0. Everything would be fine, if it were not for the fact that, say I am working with X,Y,Z in [-1.5,2]^3 (equivalent to x,y,z in [0.03,100.]^3), the vertex coordinates of the solution given by skimage.measure.marching_cubes_lewiner, are not in this cube.
Following the answer to another related question on SO, I thought it could be due to the fact that, probably the algorithm works thinking of a unitary volume, so that I need to set the right spacing input argument in my call of skimage.measure.marching_cubes_lewiner. In this fashion, say I am mapping my function f on a grid of N points per coordinate, so that I am increasing exponents by numpy.diff([-1.5,2])/N per coordinate, I accordingly call:
import numpy as np
from skimage import measure as msr
def f(x,y,z):
val = ... # some lengthy code to define my implicit function
return val
# Define ranges of my coordinates
xRange = [0.03,100.]
yRange = [0.03,100.]
zRange = [0.03,100.]
XRange = np.log10(xRange)
YRange = np.log10(yRange)
ZRange = np.log10(zRange)
# Create regular grid
N = 50 # number of points per coordinate
X,Y,Z = np.mesh[XRange[0]:XRange[1]:N*1j,
YRange[0]:YRange[1]:N*1j,
ZRange[0]:ZRange[1]:N*1j]
F = f(10**X,10**Y,10**Z)
sol,_,_,_ = skimage.measure.marching_cubes_lewiner(F,0.0,spacing(np.diff(XRange)/N,np.diff(YRange)/N,np.diff(ZRange)/N))
yet, unexpectedly, the coordinates of the solution points generally seem in [0,Vx]*[0,Vy]*[0,Vz] with Vx>XRange[-1], Vy>YRange[-1] and Vz>ZRange[-1]. I have no clue of why this happens and how I could properly rescale the coordinates of my isosurface solution, to the real units of my problem.
I am trying to generate a set of points, where groups of m points are evenly distributed over a large area. I have solved the problem (solution below), but I am looking for a more elegant or at least faster solution.
Say we have 9 points we want to place in groups of 3 in an area specified by x=[0,5] and y=[0,5]. Then I first generate a mesh in this area
meshx = 0:0.01:5;
meshy = 0:0.01:5;
[X,Y] = meshgrid(meshx,meshy);
X = X(:); Y = Y(:);
Then to place the 9/3=3 groups evenly I apply kmeans clustering
idx = kmeans([X,Y],3);
Then for each cluster, I can now draw a random sample of 3 points, which I save to a list:
pos = zeros(9,2);
for i = 1:max(idx)
spaceX = X(idx==i);
spaceY = Y(idx==i);
%on = convhulln([spaceX,spaceY]);
%plot(spaceX(on),spaceY(on),'black')
%hold on
sample = datasample([spaceX,spaceY],3,1);
%plot(sample(:,1),sample(:,2),'black*')
%hold on
pos((i-1)*3+1:i*3,:) = sample;
end
If you uncomment the comments, then the code will also plot the clusters and the location of points within. My problem is as mentioned to primarily avoid having to cluster a rather fine uniform grid to make the code more efficient.
Instead of kmeans you can use atan2 :
x= -10:10;
idx=ceil((bsxfun(#atan2,x,x.')+pi)*(3/(2*pi)));
imshow(idx,[])
I have a set of data points in 3D (X,Y,Z) in a given plane (3D). and i hope to fit an epllipse to those points.
I found a lot of answers about how to fit the ellipse in 2D points. So more precisely, my question is how to transform 3D data(x,y,z) points -> 2D data(x,y)?
Here is my Python code for this problem.
this link helped me through my implementation: https://meshlogic.github.io/posts/jupyter/curve-fitting/fitting-a-circle-to-cluster-of-3d-points/
import numpy as np
from skimage.measure import EllipseModel
#-------------------------------------------------------------------------------
# RODRIGUES ROTATION
# - Rotate given points based on a starting and ending vector
# - Axis k and angle of rotation theta given by vectors n0,n1
# P_rot = P*cos(theta) + (k x P)*sin(theta) + k*<k,P>*(1-cos(theta))
#-------------------------------------------------------------------------------
def rodrigues_rot(P, n0, n1):
# If P is only 1d array (coords of single point), fix it to be matrix
if P.ndim == 1:
P = P[np.newaxis,:]
# Get vector of rotation k and angle theta
n0 = n0/np.linalg.norm(n0)
n1 = n1/np.linalg.norm(n1)
k = np.cross(n0,n1)
k = k/np.linalg.norm(k)
theta = np.arccos(np.dot(n0,n1))
# Compute rotated points
P_rot = np.zeros((len(P),3))
for i in range(len(P)):
P_rot[i] = P[i]*np.cos(theta) + np.cross(k,P[i])*np.sin(theta) + k*np.dot(k,P[i])*(1-np.cos(theta))
return P_rot
def fit_an_ellipse(P):
P_mean = P.mean(axis=0)
P_centered = P - P_mean
# Fitting plane by SVD for the mean-centered data
U,s,V = np.linalg.svd(P_centered, full_matrices=False)
# Normal vector of fitting plane is given by 3rd column in V
# Note linalg.svd returns V^T, so we need to select 3rd row from V^T
# normal on 3d plane
normal = V[2,:]
# Project points to coords X-Y in 2D plane
P_xy = rodrigues_rot(P_centered, normal, [0,0,1])
# Use skimage EllipseModel to fit an ellipse to set of 2d points
ell = EllipseModel()
ell.estimate(P_xy[:, :2])
# Generate n 2D points on the fitted elippse
n = 100
xy = ell.predict_xy(np.linspace(0, 2 * np.pi, n))
# Convert the 2D generated points to the 3D space
points = []
for i in range(len(xy)):
points.append([xy[i, 0], xy[i, 1], 0])
points = np.array(points)
ellipse_points_3d = rodrigues_rot(points, [0,0,1], normal) + P_mean
return ellipse_points_3d
to test the code you can run this and check the output result:
import numpy as np
import plotly
import plotly.graph_objs as go
P = np.array([[52.21818786, 7.86337722, 57.83456389],
[30.55316226, 32.36591494, 14.35753359],
[59.77387002, 14.29531811, 53.6462596 ],
[42.85677086, 32.67223954, -5.95323959],
[44.46449002, 1.43144171, 54.0253186 ],
[27.6464027 , 19.80836045, -1.5754063 ],
[63.48591069, 6.88329618, 57.55556516],
[44.19484831, 28.32302575, 6.01730042],
[46.09443886, 2.71782362, 57.98617489],
[22.55050927, 30.28315605, 42.5642505 ],
[20.16244533, 18.55944689, 34.06871328],
[69.4591254 , 33.62256919, 40.91996533],
[24.42183439, 5.95578526, 35.80224431],
[70.09161495, 24.03152634, 45.77915268],
[28.68122335, -6.64788396, 37.53577535],
[59.84340586, 23.47833222, 60.01530894],
[23.98376474, 14.23114661, 32.43676647],
[73.28044481, 29.29426891, 39.28801852],
[28.48679585, -5.33220296, 36.04206575],
[54.66351746, 15.7561502 , 51.20981383],
[38.33444206, -0.08003422, 41.2639318 ],
[57.27722964, 39.91662965, 20.63778872],
[43.24856256, 7.79042068, 50.95451935],
[64.68788661, 31.78841088, 27.19632274],
[41.67377653, -0.18313508, 49.56081237],
[60.577958 , 35.8138609 , 28.9005053 ]])
ellipse_points = fit_an_ellipse(P)
lines = []
lines.append(go.Scatter3d(x=P[:, 0], y=P[:, 1], \
z=P[:, 2],name='P'\
,opacity=1, ))
lines.append(go.Scatter3d(x=ellipse_points[:, 0], y=ellipse_points[:, 1], \
z=ellipse_points[:, 2],name='ellipse_points'\
,opacity=1, ))
plotly.offline.iplot(lines)
the output result:
the 3d fitting result
you can try the code yourself in colab: https://colab.research.google.com/drive/1snI2-_S1CY8iUtszRP1bzsFULEQYdJym?usp=sharing
In standard projections the ellipse (and circle is elipse with a = b = r) will be projected in ellipse or a line. So I will use this behaviour so the 3D ellipse you want will be defined by different ellipse you can calculate.
I will not show the code, but the approach can be:
Make the data be defined in M x 3 matrix, where M is number of points, in form of [x,y,z]
Define the plane in form of z=f(x,y)
Search the data matrix for rows equal or simillar to [x,y,f(x,y)] vectors
Suppose the resulting point are elipse-shape-spaced. Then use the answer how to fit ellipse to [x,y] pairs in the matrix resulted from the search (ignoring the z part can be understtod as projection to x-y plane).
Transform the fitted data to N x 2 matrix in form of [x_fit,y_fit]
Expand the last matrix into form of [x_fit,y_fit,f(x_fit,y_fit)]
Voila - here we have fitted elipsis.
Suppose I have 3+ coplanar but not collinear points in R^4. To find the 2D plane (not hyperplane) in which they all lie, I used the following plane fit algorithm from MatlabCentral:
function [n,V,p] = affine_fit(X)
% Computes the plane that fits best (least square of the normal distance
% to the plane) a set of sample points.
% INPUTS:
% X: a N by 3 matrix where each line is a sample point
%OUTPUTS:
%n : a unit (column) vector normal to the plane
%V : a 3 by 2 matrix. The columns of V form an orthonormal basis of the plane
%p : a point belonging to the plane
%NB: this code actually works in any dimension (2,3,4,...)
%Author: Adrien Leygue
%Date: August 30 2013
% the mean of the samples belongs to the plane
p = mean(X,1);
% The samples are reduced:
R = bsxfun(#minus,X,p);
% Computation of the principal directions of the samples cloud
[V,D] = eig(R'*R);
% Extract the output from the eigenvectors
n = V(:,1);
V = V(:,2:end);
end
I employed the algorithm in a higher dimension than specified, so X is a 4x4 matrix which holds 4 points in 4 coordinate dimensions. The generated output is something like this.
[n,V,p] = affine_fit(X);
n = -0.0252
-0.0112
0.9151
-0.4024
V = 0.9129 -0.3475 0.2126
0.3216 0.2954 -0.8995
0.1249 0.3532 0.1493
0.2180 0.8168 0.3512
p = -0.9125 1.0526 0.2325 -0.0621
What I want to do now is find out if other points of my choosing are part of the plane, too. I'm sure it's fairly easy given the information above, yet at this point I only know that I need two linear equations to describe a 2D plane in 4D or parametric equations of two variables. I can set them up in theory, but writing up the code has been problematic. Perhaps there is a more straightforward way to test this in matlab?
You can use the Matlab function pca (see for example here). For example, you can determine the basis of your plane, the normal vectors to your plane and a point m on the plane as follows:
coeff = pca(X);
basis = coeff(:,1:2);
normals = coeff(:,3:4);
m = mean(X);
To check if a point p lies in this plane, it suffices to verify that m-p is orthogonal (dot product equal to zero) to the normal vectors onto the plane using dot.