Detecting Patterns in dataset using clustering algorithm - cluster-analysis

I have a dataset that contains a cross pattern. How can I filter off this cross-like pattern?
Tried DBScan but it didn't work effectively. Also, can't use any cluster which needs to specify number of clusters as the data cleaning needs to be automated.

Sorry, forget SVM, that's for classification. Honestly, I didn't ready your question carefully the first time. I just re-read what you posted originally. Try Mean Shift for automatically detecting the optimal number of clusters. Here's an example. Hopefully you can adapt it for your specific use.
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets import make_blobs
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1], [1, -1], [1, -1]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.2)
# #############################################################################
# Compute clustering with MeanShift
# The following bandwidth can be automatically detected using
bandwidth = estimate_bandwidth(X, quantile=0.6, n_samples=5000)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
labels_unique = np.unique(labels)
n_clusters_ = len(labels_unique)
print("number of estimated clusters : %d" % n_clusters_)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
my_members = labels == k
cluster_center = cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
Or, try this.
import numpy as np
import pandas as pd
from sklearn.cluster import MeanShift
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import make_blobs
# We will be using the make_blobs method
# in order to generate our own data.
clusters = [[2, 2, 2], [7, 7, 7], [5, 13, 13]]
X, _ = make_blobs(n_samples = 150, centers = clusters,
cluster_std = 0.60)
# After training the model, We store the
# coordinates for the cluster centers
ms = MeanShift()
ms.fit(X)
cluster_centers = ms.cluster_centers_
# Finally We plot the data points
# and centroids in a 3D graph.
fig = plt.figure()
ax = fig.add_subplot(111, projection ='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], marker ='o')
ax.scatter(cluster_centers[:, 0], cluster_centers[:, 1],
cluster_centers[:, 2], marker ='x', color ='red',
s = 300, linewidth = 5, zorder = 10)
plt.show()
There are a few clustering methodologies that help you choose the optimal number of clusters automatically. Check out the link below for some ideas of how to move forward with your project.
https://scikit-learn.org/stable/modules/clustering.html

Related

Is the inplace operation with scipy gaussian_filter1d safe?

Here is the sample code I wrote to examine this issue.
It can be seen that in this case we get the same result, but I want to know if it is safe to compute inplace with other options (scipy version, augment, ...).
import numpy as np
from scipy.ndimage import gaussian_filter1d
X = np.random.normal(0, 1, size=[64, 1024, 2048])
OPX = X.copy()
for axis, sigma in zip([-2, -1], [3, 7]):
gaussian_filter1d(OPX, sigma, axis, output=OPX)
OPY, OPZ = X.copy(), X.copy()
for axis, sigma in zip([-2, -1], [3, 7]):
gaussian_filter1d(OPY, sigma, axis, output=OPZ)
OPY, OPZ = OPZ, OPY
(OPX == OPY).all() # True
python 3.7.15
scipy 1.7.3
numpy 1.21.6

Spatial Points Outlier Clustering Method

I would like to implement an unsupervised clustering to detect grids (vertical/horizontal lines) for spatial points.
I have tried DBSCAN and it gives subpar results. It is able to pick out the grids as seen in red below:
However, it is not able to completely pick out all the points that form the vertical/horizontal lines and if i relax the parameters of epsilon, it will incorrectly classify more points as noisy (e.g. the bottom left of the picture).
I was wondering if maybe there is a modification model of DBSCAN that uses ellipse instead of circles? Or any other clustering methods recommended for this that does not need to prespecify the number of clusters?
Or is there a better method to identify these points that make the grid? Any help is appreciated.
You can use an anisotropical DBSCAN by modifying your data this way : value of anisotropy >1 will find vertical clusters and values <1 will find horizontal clusters.
from sklearn.cluster import DBSCAN
def anisotropical_DBSCAN(X, anisotropy, eps, min_samples):
"""ANIsotropic DBSCAN clustering : some documentation would be nice here :)
returns an array with """
X[:, 1] = X[:, 1]*anisotropy
db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)
return db
Here is a full example with data :
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(
n_samples=750, centers=centers, cluster_std=0.4, random_state=0
)
print(X.shape)
def anisotropical_DBSCAN(X, anisotropy, eps, min_samples):
"""ANIsotropic DBSCAN clustering : some documentation would be nice here :)
returns an array with """
X[:, 1] = X[:, 1]*anisotropy
db = DBSCAN(eps=eps, min_samples=min_samples).fit(X)
return db
db = anisotropical_DBSCAN(X, anisotropy = 0.1, eps = 0.1, min_samples = 10)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = labels == k
xy = X[class_member_mask & core_samples_mask]
plt.plot(
xy[:, 0],
xy[:, 1],
"o",
markerfacecolor=tuple(col),
markeredgecolor="k",
markersize=14,
)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(
xy[:, 0],
xy[:, 1],
"o",
markerfacecolor=tuple(col),
markeredgecolor="k",
markersize=6,
)
plt.title("Estimated number of clusters: %d" % n_clusters_)
You get vertical clusters :
Now change the parameters to db = anisotropical_DBSCAN(X, anisotropy = 10, eps = 1, min_samples = 10) I had to change eps value because the horizontal scale and vertical scale arent the same, but in your case, you should be able to keep the same (eps, min sample) for detecting lines
And you get horizontal clusters :
There are also implementations of anisotropical DBSCAN that are probably a lot cleaner https://github.com/gissong/ADCN

why k-means is better in clustering than topic modelling algorithms like LDA?

I want to know about the advantages of K-means in clustering essays to discover their topics. There are a lot of algorithms to do it such as K-medoid, x-means, LDA, LSA, etc. Please give me a full description of the motives to select k-means algorithms
I don't think you can draw parallels between all these things. I would highly recommend doing some well-defined Googling on your side, and come back here with a more refined question, or questions. In the meantime, I'll share with you what little I know about these topics. First, let's look at PCA & LDA...
import numpy as np
import pandas as pd
# Importing the Dataset
#url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
#names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
#dataset = pd.read_csv(url, names=names)
dataset = pd.read_csv('C:\\your_path_here\\iris.csv')
# PRINCIPAL COMPONENT ANALYSIS
X = dataset.drop('species', 1)
y = dataset['species']
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# As mentioned earlier, PCA performs best with a normalized feature set. We will perform standard scalar normalization to normalize our feature set. To do this, execute the following code:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.decomposition import PCA
pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Performance Evaluation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[11 0 0]
[ 0 12 1]
[ 0 1 5]]
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))
Accuracy 0.9333333333333333
# Results with 2 & 3 pirncipal Components
#from sklearn.decomposition import PCA
#pca = PCA(n_components=5)
#X_train = pca.fit_transform(X_train)
#X_test = pca.transform(X_test)
# https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/
# LINEAR DISCRIMINANT ANALYSIS
# Data Preprocessing
# Once dataset is loaded into a pandas data frame object, the first step is to divide dataset into features and corresponding labels and then divide the resultant dataset into training and test sets. The following code divides data into labels and feature set:
X = dataset.iloc[:, 0:4].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Feature Scaling
# As was the case with PCA, we need to perform feature scaling for LDA too. Execute the following script to do so:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth=2, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
[[11 0 0]
[ 0 13 0]
[ 0 0 6]]
print('Accuracy ' + str(accuracy_score(y_test, y_pred)))
Result:
Accuracy 1.0
# https://stackabuse.com/implementing-lda-in-python-with-scikit-learn/
Does that make sense? Hopefully it does. Now, let's move on to KMeans and PCA...
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
dataset = pd.read_csv('C:\\your_path_here\\iris.csv')
# PRINCIPAL COMPONENT ANALYSIS
X = dataset.drop('species', 1)
y = dataset['species']
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
ax = sns.scatterplot(x="sepal_length", y="sepal_width", hue="sepal_length", data=dataset)
ax = sns.scatterplot(x="petal_length", y="petal_width", hue="petal_length", data=dataset)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
# ordinarily, when you don't have the actual labels, you might use
# silhouette analysis to determine a good number of clusters k to use.
# i.e. you would just run that same code for different values of k and print the value for
# the silhouette score.
# let's see what that value is for the case we just did, k=3.
from sklearn import metrics
score = metrics.silhouette_score(X_scaled, y_cluster_kmeans)
score
# Result:
# 0.45994823920518646
# note that this is the mean over all the samples - there might be some clusters
# that are well separated and others that are closer together.
# so let's look at the distribution of silhouette scores...
scores = metrics.silhouette_samples(X_scaled, y_cluster_kmeans)
sns.distplot(scores);
# so you can see that the blue species have higher silhouette scores
# (the legend doesn't show the colors though... so the pandas plot is more useful).
# note that if we used the best mean silhouette score to try to find the best
# number of clusters k, we'd end up with 2 clusters, because the mean silhouette
# score in that case would be largest, since the clusters would be better separated.
# but, that's using k-means - gmm might give better results...
# so that was clustering on the orginal 4d data.
# if you have a lot of features it can be helpful to do some feature reduction
# to avoid the curse of dimensionality (i.e. needing exponentially more data
# to do accurate predictions as the number of features grows).
# you can do this with Principal Component Analysis (PCA), which remaps the data
# to a new (smaller) coordinate system which tries to account for the
# most information possible.
# you can *also* use PCA to visualize the data by reducing the
# features to 2 dimensions and making a scatterplot.
# it kind of mashes the data down into 2d, so can lose
# information - but in this case it's just going from 4d to 2d,
# so not losing too much info.
# so let's just use it to visualize the data...
# mash the data down into 2 dimensions
from sklearn.decomposition import PCA
ndimensions = 2
pca = PCA(n_components=ndimensions, random_state=seed)
pca.fit(X_scaled)
X_pca_array = pca.transform(X_scaled)
X_pca = pd.DataFrame(X_pca_array, columns=['PC1','PC2']) # PC=principal component
X_pca.sample(5)
# Result:
PC1 PC2
90 0.279078 -1.120029
26 -2.051151 0.242164
83 1.061095 -0.633843
135 2.798770 0.856803
54 1.075475 -0.208421
# so that gives us new 2d coordinates for each data point.
# at this point, if you don't have labelled data,
# you can add the k-means cluster ids to this table and make a
# colored scatterplot.
# we do actually have labels for the data points, but let's imagine
# we don't, and use the predicted labels to see what the predictions look like.
df_plot = X_pca.copy()
df_plot['ClusterKmeans'] = y_cluster_kmeans
df_plot['SpeciesId'] = y_id_array # also add actual labels so we can use it in later plots
df_plot.sample(5)
# Result:
PC1 PC2 ClusterKmeans SpeciesId
132 1.862703 -0.178549 0 2
85 0.429139 0.845582 0 1
139 1.852045 0.676128 0 2
33 -2.446177 2.150728 1 0
147 1.521170 0.269069 0 2
# so now we can make a 2d scatterplot of the clusters
# first define a plot fn
def plotData(df, groupby):
"make a scatterplot of the first two principal components of the data, colored by the groupby field"
# make a figure with just one subplot.
# you can specify multiple subplots in a figure,
# in which case ax would be an array of axes,
# but in this case it'll just be a single axis object.
fig, ax = plt.subplots(figsize = (7,7))
# color map
cmap = mpl.cm.get_cmap('prism')
# we can use pandas to plot each cluster on the same graph.
# see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
for i, cluster in df.groupby(groupby):
cluster.plot(ax = ax, # need to pass this so all scatterplots are on same graph
kind = 'scatter',
x = 'PC1', y = 'PC2',
color = cmap(i/(nclusters-1)), # cmap maps a number to a color
label = "%s %i" % (groupby, i),
s=30) # dot size
ax.grid()
ax.axhline(0, color='black')
ax.axvline(0, color='black')
ax.set_title("Principal Components Analysis (PCA) of Iris Dataset");
# plot the clusters each datapoint was assigned to
plotData(df_plot, 'ClusterKmeans')
# so those are the *predicted* labels - what about the *actual* labels?
plotData(df_plot, 'SpeciesId')
# so the k-means clustering *did not* find the correct clusterings!
# q. so what do these dimensions mean?
# they're the principal components, which pick out the directions
# of maximal variation in the original data.
# PC1 finds the most variation, PC2 the second-most.
# the rest of the data is basically thrown away when the data is reduced down to 2d.

how to compute accuracy of AgglomerativeClustering

hi i use the sample in python of AgglomerativeClustering i try to estimate the performance but it switches the original labels
i try to compare the predicted labels y_hc and the original label y return by make blobs
import scipy.cluster.hierarchy as sch
from sklearn.cluster import AgglomerativeClustering
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt
data,y = make_blobs(n_samples=300, n_features=2, centers=4, cluster_std=2, random_state=50)
plt.figure(2)
# create dendrogram
dendrogram = sch.dendrogram(sch.linkage(data, method='ward'))
plt.title('dendrogram')
# create clusters linkage="average", affinity=metric , linkage = 'ward' affinity = 'euclidean'
hc = AgglomerativeClustering(n_clusters=4, linkage="average", affinity='euclidean')
# save clusters for chart
y_hc = hc.fit_predict(data,y)
plt.figure(3)
# create scatter plot
plt.scatter(data[y==0,0], data[y==0,1], c='red', s=50)
plt.scatter(data[y==1, 0], data[y==1, 1], c='black', s=50)
plt.scatter(data[y==2, 0], data[y==2, 1], c='blue', s=50)
plt.scatter(data[y==3, 0], data[y==3, 1], c='cyan', s=50)
plt.xlim(-15,15)
plt.ylim(-15,15)
plt.scatter(data[y_hc ==0,0], data[y_hc == 0,1], s=10, c='red')
plt.scatter(data[y_hc==1,0], data[y_hc == 1,1], s=10, c='black')
plt.scatter(data[y_hc ==2,0], data[y_hc == 2,1], s=10, c='blue')
plt.scatter(data[y_hc ==3,0], data[y_hc == 3,1], s=10, c='cyan')
for ii in range(4):
print(ii)
i0=y_hc==ii
counts = np.bincount(y[i0])
valCountAtorgLbl = (np.argmax(counts))
accuracy0Tp=100*np.max(counts)/y[y==valCountAtorgLbl].shape[0]
accuracy0Fp = 100 * np.min(counts) / y[y ==valCountAtorgLbl].shape[0]
print([accuracy0Tp,accuracy0Fp])
plt.show()
The clustering does and cannot reproduce the original labels, only the original partitions.
You seem to assume that cluster 1 corresponds to label 1 (in faftz one could be labeled 'iris setosa', and there obviously is no way an unsupervised algorithm will come up with this cluster name...). It usually won't - there probably isn't the same number of clusters and classes there either, and there could be unlabeled noise piintsl You can use the Hungarian algorithm to compute the optimum mapping (or just a greedy matching) to produce a more intuitive color mapping.

Grid Search in Multi class classification problems using Neural networks

I'm trying to do grid search for a multi class problem in neural networks.
I am not able to get the optimum parameters, the kernel keeps on compiling.
Is there any problem with my code? Please do help
import keras
from keras.models import Sequential
from keras.layers import Dense
# defining the baseline model:
def neural(output_dim=10,init_mode='glorot_uniform'):
model = Sequential()
model.add(Dense(output_dim=output_dim,
input_dim=2,
activation='relu',
kernel_initializer= init_mode))
model.add(Dense(output_dim=output_dim,
activation='relu',
kernel_initializer= init_mode))
model.add(Dense(output_dim=3,activation='softmax'))
# Compile model
model.compile(loss='sparse_categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
estimator = KerasClassifier(build_fn=neural,
epochs=5,
batch_size=5,
verbose=0)
# define the grid search parameters
batch_size = [10, 20, 40, 60, 80, 100]
epochs = [10, 50, 100]
init_mode = ['uniform', 'lecun_uniform', 'normal', 'zero',
'glorot_normal', 'glorot_uniform', 'he_normal', 'he_uniform']
output_dim = [10, 15, 20, 25, 30,40]
param_grid = dict(batch_size=batch_size,
epochs=epochs,
output_dim=output_dim,
init_mode=init_mode)
grid = GridSearchCV(estimator=estimator,
scoring= 'accuracy',
param_grid=param_grid,
n_jobs=-1,cv=5)
grid_result = grid.fit(X_train, Y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_,
grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
There's no error in your code.
Your current param grid has 864 different combinations of parameters possible.
(6 values in 'batch_size' × 3 values in 'epochs' × 8 in 'init_mode' × 6 in 'output_dim') = 864
GridSearchCV will iterate over all those possibilities and your estimator will be cloned that many times. And that is again repeated 5 times because you have set cv=5.
So your model will be cloned (compiled and params set according to the possibilities) a total of 864 x 5 = 4320 times.
So you keep seeing in output that the model is being compiled those many times.
To check if GridSearchCV is working or not, use its verbose param.
grid = GridSearchCV(estimator=estimator,
scoring= 'accuracy',
param_grid=param_grid,
n_jobs=1,cv=5, verbose=3)
This will print the current possible params being tried, the cv iteration, time taken to fit on it, current accuracy etc.