How to find top terms in dbscan or hdbscan clusters? - cluster-analysis

I'm using dbscan from sklearn and HDBSCAN to cluster some documents.
vectorizer = TfidfVectorizer(stop_words=mystopwords)
X = vectorizer.fit_transform(y)
dbscan = DBSCAN(eps=0.75, min_samples = 9)
clusters = dbscan.fit_predict(X)
Now how can I get the top terms in each cluster? When using kmeans we do something like below :
order_centroids = kmeans_model.cluster_centers_.argsort()[:, ::-1]
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :true_k]:
print(' %s' % terms[ind])
But in dbscan and hdbscan we don't have centroids. How can we find the top terms in clusters of dbscan or hdbscan?

Related

Can there be overlap in k-means clusters?

I am unclear about why k-means clustering can have overlap in clusters. From Chen (2018) I saw the following definition:
"..let the observations be a sample set to be partitioned into K disjoint clusters"
However I see an overlap in my plots, and am not sure why this is the case.
For reference, I am trying to cluster a multi-dimensional dataset with three variables (Recency, Frequency, Revenue). To visualize clustering, I can project 3D data into 2D using PCA and run k-means on that. Below is the code and plot I get:
df1=tx_user[["Recency","Frequency","Revenue"]]
#standardize
names = df1.columns
# Create the Scaler object
scaler = preprocessing.StandardScaler()
# Fit your data on the scaler object
scaled_df1 = scaler.fit_transform(df1)
df1 = pd.DataFrame(scaled_df1, columns=names)
df1.head()
del scaled_df1
sklearn_pca = PCA(n_components = 2)
X1 = sklearn_pca.fit_transform(df1)
X1 = X1[:, ::-1] # flip axes for better plotting
kmeans = KMeans(3, random_state=0)
labels = kmeans.fit(X1).predict(X1)
plt.scatter(X1[:, 0], X1[:, 1], c=labels, s=40, cmap='viridis');
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
def plot_kmeans(kmeans, X, n_clusters=4, rseed=0, ax=None):
labels = kmeans.fit_predict(X)
# plot the input data
ax = ax or plt.gca()
ax.axis('equal')
#ax.set_ylim(-5000,7000)
ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
# plot the representation of the KMeans model
centers = kmeans.cluster_centers_
radii = [cdist(X[labels == i], [center]).max()
for i, center in enumerate(centers)]
for c, r in zip(centers, radii):
ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.5, zorder=1))
kmeans = KMeans(n_clusters=4, random_state=0)
plot_kmeans(kmeans, X1)
My question is:
1. Why is there an overlap? Is my clustering wrong if there is?
2. How does k-means decide cluster assignment incase there is an overlap?
Thank you
Reference:
Chen, L., Xu, Z., Wang, H., & Liu, S. (2018). An ordered clustering algorithm based on K-means and the PROMETHEE method. International Journal of Machine Learning and Cybernetics, 9(6), 917-926.
K-means computes k clusters by average approximation. Each cluster is defined by their computed center and thus is unique by definition.
Sample assignment is made to cluster with closest distance from cluster center, also unique by definition. Thus in this sense there is NO OVERLAP.
However for given distance d>0 a sample may be within d-distance to more than one cluster center (it is possible). This is what you see when you say overlap. However still the sample is assigned to closest cluster not to all of them. So no overlap.
NOTE: In the case where a sample has exactly same closest distance to more than one cluster center any random assignment can be made between the closest clusters and this changes nothing important in the algorithm or results since clusters are re-computed after assignment.
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into K-pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the inter-cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns data points to a cluster such that the sum of the squared distance between the data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation we have within clusters, the more homogeneous (similar) the data points are within the same cluster.
Perhaps you did something wrong... I don't have your data, so I can't test it. You can add boundaries, and check those. See the sample code below.
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import Voronoi
def voronoi_finite_polygons_2d(vor, radius=None):
"""
Reconstruct infinite voronoi regions in a 2D diagram to finite
regions.
Parameters
----------
vor : Voronoi
Input diagram
radius : float, optional
Distance to 'points at infinity'.
Returns
-------
regions : list of tuples
Indices of vertices in each revised Voronoi regions.
vertices : list of tuples
Coordinates for revised Voronoi vertices. Same as coordinates
of input vertices, with 'points at infinity' appended to the
end.
"""
if vor.points.shape[1] != 2:
raise ValueError("Requires 2D input")
new_regions = []
new_vertices = vor.vertices.tolist()
center = vor.points.mean(axis=0)
if radius is None:
radius = vor.points.ptp().max()*2
# Construct a map containing all ridges for a given point
all_ridges = {}
for (p1, p2), (v1, v2) in zip(vor.ridge_points, vor.ridge_vertices):
all_ridges.setdefault(p1, []).append((p2, v1, v2))
all_ridges.setdefault(p2, []).append((p1, v1, v2))
# Reconstruct infinite regions
for p1, region in enumerate(vor.point_region):
vertices = vor.regions[region]
if all([v >= 0 for v in vertices]):
# finite region
new_regions.append(vertices)
continue
# reconstruct a non-finite region
ridges = all_ridges[p1]
new_region = [v for v in vertices if v >= 0]
for p2, v1, v2 in ridges:
if v2 < 0:
v1, v2 = v2, v1
if v1 >= 0:
# finite ridge: already in the region
continue
# Compute the missing endpoint of an infinite ridge
t = vor.points[p2] - vor.points[p1] # tangent
t /= np.linalg.norm(t)
n = np.array([-t[1], t[0]]) # normal
midpoint = vor.points[[p1, p2]].mean(axis=0)
direction = np.sign(np.dot(midpoint - center, n)) * n
far_point = vor.vertices[v2] + direction * radius
new_region.append(len(new_vertices))
new_vertices.append(far_point.tolist())
# sort region counterclockwise
vs = np.asarray([new_vertices[v] for v in new_region])
c = vs.mean(axis=0)
angles = np.arctan2(vs[:,1] - c[1], vs[:,0] - c[0])
new_region = np.array(new_region)[np.argsort(angles)]
# finish
new_regions.append(new_region.tolist())
return new_regions, np.asarray(new_vertices)
# make up data points
np.random.seed(1234)
points = np.random.rand(15, 2)
# compute Voronoi tesselation
vor = Voronoi(points)
# plot
regions, vertices = voronoi_finite_polygons_2d(vor)
print("--")
print(regions)
print("--")
print(vertices)
# colorize
for region in regions:
polygon = vertices[region]
plt.fill(*zip(*polygon), alpha=0.4)
plt.plot(points[:,0], points[:,1], 'ko')
plt.axis('equal')
plt.xlim(vor.min_bound[0] - 0.1, vor.max_bound[0] + 0.1)
plt.ylim(vor.min_bound[1] - 0.1, vor.max_bound[1] + 0.1)
Great resource here.
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html

Divide area into N convex fields efficiently

I am trying to generate a set of points, where groups of m points are evenly distributed over a large area. I have solved the problem (solution below), but I am looking for a more elegant or at least faster solution.
Say we have 9 points we want to place in groups of 3 in an area specified by x=[0,5] and y=[0,5]. Then I first generate a mesh in this area
meshx = 0:0.01:5;
meshy = 0:0.01:5;
[X,Y] = meshgrid(meshx,meshy);
X = X(:); Y = Y(:);
Then to place the 9/3=3 groups evenly I apply kmeans clustering
idx = kmeans([X,Y],3);
Then for each cluster, I can now draw a random sample of 3 points, which I save to a list:
pos = zeros(9,2);
for i = 1:max(idx)
spaceX = X(idx==i);
spaceY = Y(idx==i);
%on = convhulln([spaceX,spaceY]);
%plot(spaceX(on),spaceY(on),'black')
%hold on
sample = datasample([spaceX,spaceY],3,1);
%plot(sample(:,1),sample(:,2),'black*')
%hold on
pos((i-1)*3+1:i*3,:) = sample;
end
If you uncomment the comments, then the code will also plot the clusters and the location of points within. My problem is as mentioned to primarily avoid having to cluster a rather fine uniform grid to make the code more efficient.
Instead of kmeans you can use atan2 :
x= -10:10;
idx=ceil((bsxfun(#atan2,x,x.')+pi)*(3/(2*pi)));
imshow(idx,[])

Matlab code for finding cluster centre in hierarchial clustering

I am trying to find the cluster centers in hierarchical clustering. Below is the code i use. But this returns only the cluster numbers for each of the observations.
c = clusterdata(input,'linkage','ward','savememory','off','maxclust',10);
I am dealing with multi-dimensional data (32 dimensions). Any ideas or code would be very helpful
It really depends on how you define "center", but since you're going with hierarchical clustering, I'm assuming you don't have a parametric model for the distributions of your clusters. This simply computes the barycenter of all points in each cluster.
[n,p] = size(input);
labels = clusterdata(input,'linkage','ward','savememory','off','maxclust',10);
centers = zeros(10,p);
for i = 1:10
centers(i,:) = mean( input( labels == i, : ) );
end

What data of images are given to kmeans clustering in matlab?

Iam having 100 images in my database.Iam using those 100 images as both training set and also test images.I have to make 5 clusters.Iam using eigen faces(PCA) for feature extraction.What data should be given for kmeans command in matlab?
Syntax for kmeans command:
[IDX,C] = kmeans(X,k)
1.What is the X value?
2.Whether we have to give euclidian distance as input?
3.Whether we have to give weight vector of input images?
Please explain me in detail.
Source code i tried
X = []
srcFiles = dir('C:\Users\rahul\Desktop\tomorow\*.jpg'); % the folder in which ur images exists
for i = 1 : length(srcFiles)
filename = strcat('C:\Users\rahul\Desktop\tomorow\',srcFiles(b).name);
Imgdata = imread(filename);
X(:, i) = princomp(Imgdata);
end
[idx, c] = kmeans(X, 5)
Error iam getting:
Index exceeds matrix dimensions.
Error in pca (line 4)
filename =strcat('C:\Users\rahul\Desktop\tomorow\',srcFiles(b).name);
The PCA function you are using (I don't know what it is exactly), produces a vector of n numbers. This vectors describes the picture, and is what needs to be given to the k-means algorithm.
First of all, run the PCA for all 100 images, producing a nX100 matrix.
X = []
for i = 1 : 100
X(:, i) = PCA(picture...)
end
If pca return a line instead of column, you need
X(:, i) = PCA(picture)'
The k-means functions takes this parameter, as well as the number k of clusters. So
[idx, c] = kmeans(X, 5);
The distance used for clustering is euclidean by default. If you want some different distance metric, you can supply it as a parameter. See the table here for the available distance metrics.
Finally, the standard k-means algorithm is not weighted, so you can't supply weights to the vectors.

kmeans on fisherIris data

I have the following script for kmeans in Matlab:
load fisheriris
k = 3;
clusterIndex = kmeans(meas,3);
scatter(meas(:,1),meas(:,2),[],clusterIndex, 'filled')
How to plot the centroids of each group?
Please help!
Straight from the docs:
[IDX,C] = kmeans(X,k) returns the k cluster centroid locations in the
k-by-p matrix C
So in your case simply do this:
[clusterIndex, centroids] = kmeans(meas,3);
By the way you might like gscatter, it will colour your clusters nicely for you.