AffinityPropagation with Sentence Transformar not converging - cluster-analysis

I'm trying to cluster a text dataset using Sentence Transformer and Affinity Propagation but I keep getting "ConvergenceWarning: Affinity propagation did not converge, this model will not have any cluster centers." and, therefore, no clustering. I'm trying to figure it out for some time but couldn't make it work and there is few documentation online for text clustering using this algorithm.
from sklearn.cluster import AffinityPropagation
import numpy as np
model = SentenceTransformer('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
sentence_embeddings = model.encode(texts)
affprop = AffinityPropagation()
af = affprop.fit(sentence_embeddings)
cluster_centers_indices = af.cluster_centers_indices_
len(cluster_centers_indices) # this line returns zero
Did someone get this same problem and have some workaround to suggest?

Related

I want to make cluster of sentences but now i don't know that how many cluster will be made

I have calculated the embedding with the help of doc2vec and I have also calculated the distance between sentences in vector form. now I have a vector of sentences that tells the distance between them(sentences). how can I cluster them without giving the number of clusters? I have used k-means and agglomerative algo but they are not giving me good results. can anybody tell me the best method to determine the optimal number of clusters?
Try this. If it doesn't do what you want, I have a few other code samples to share. This may be the best option. The best option to use, can change, based on the dataset that you feed into the algo.
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = "kitten belly squooshy merley best eating google feedback face extension impressed map feedback google eating face extension climbing key".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Result:

Applying k-folds (stratified 10-fold) to my text classification model

I need help in the following, I have a data frame with Columns: Class (0,1) and text.
After cleansing (lemmatizing, removing stopwords, etc), I split the data like the following:
#splitting datset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset['text_lemm_nostop'], dataset['class'],test_size=0.3, random_state=50)
Then I used n-gram:
#using n-gram
from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(min_df=5,ngram_range=(2,2), max_features=5000).fit(X_train)
print('No. of features:')
len(vect.get_feature_names_out()) # how many features
Then I did the vectorizing:
X_train_vectorized=vect.transform(X_train)
X_test_vectorized=vect.transform(X_test)
Then I started applying ML algorithms like (logistic regression, Naive Bayes, RF,..etc)
and I will share only the logistic regression
#Logistic regression
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(X_train_vectorized, y_train)
from sklearn.metrics import classification_report, accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn import metrics
predictions=model.predict(vect.transform(X_test))
print("AUC Score is: ", roc_auc_score(y_test, predictions),'\n')
print("Accuracy:",metrics.accuracy_score(y_test, predictions),'\n')
print('Classification Report:\n',classification_report(y_test,predictions))
My Questions:
1. Is what I am doing is fine in case I am going with normal splitting (30% test)?
(I feel having issues with n-gram code!)
2. If I want to engage the K-fold cross-validation (ie. Stratified 10-fold), how could I do that in my code?
Appreciate any help and support!!
A few comments:
Generally I don't see any obvious mistake, seems fine to me.
Do you really mean to use only bigrams (n=2)? This implies that you don't use unigrams (single words), so you might miss some information this way. It's more common to use ngram_range=(1,2) for instance. It's probably worth trying unigrams only, at least to compare the performance.
To perform 10-fold cross-validation, I'd suggest you use KFold because it's the simplest method. In this case it's almost the same code, except that you insert it inside a loop which iterates the 10 folds. Every time you fit the model on a specific training set (90%) and evaluate it on the remaining 10% test set. Both training and test set are obtained by selecting the corresponding indexes indexes returned by KFold.

Feature Selection in Multivariate Linear Regression

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
a = make_regression(n_samples=300,n_features=5,noise=5)
df1 = pd.DataFrame(a[0])
df1 = pd.concat([df1,pd.DataFrame(a[1].T)],axis=1,ignore_index=True)
df1.rename(columns={0:"X1",1:"X2",2:"X3",3:"X4",4:"X5",5:"Target"},inplace=True)
sns.heatmap(df1.corr(),annot=True);
Correlation Matrix
Now I can ask my question. How can I choose features that will be included in the model?
I am not that well-versed in python as I use R most of the time.
But it should be something like this:
# Create a model
model = LinearRegression()
# Call the .fit method and pass in your data
model.fit(Variables,Target)
# Or simply do
model = LinearRegression().fit(Variables,Target)
# So based on the dataset head provided, it should be
X<-df1[['X1','X2','X3','X4','X5']]
Y<-df1['Target']
model = LinearRegression().fit(X,Y)
In order to do feature selections. You need to run the model first. Then check for the p-value. Typically, a p-value of 5% (.05) or less is a good cut-off point. If the p-value crosses the upper threshold of .05, the variable is insignificant and you can remove it from your model. You will have to do this manually. You can also tell by looking from the correlation matrix to see which value has less correlation to the target. AFAIK, there are no libs with built-in functionality to do feature selection automatically. In the end, statistics are just numbers. It is up to humans to interpret the results.

How to use FaceNet and DBSCAN on multiple embeddings identities?

I have the following setting:
A surveillance system take photos of people's faces (there are a varying number of photos for each person).
I run FaceNet for each photo and get a list of embedding vectors for each person (each person is represented by a list of embeddings, not by a single one).
The problem:
I want to cluster observed people using DBSCAN, but I need to guarantee that face embeddings from the same people go to the same cluster (remember we can have multiple photos of the same people, and we already know they must belong to the same cluster).
One solution could be to get a "mean" or average embedding for each person, but I believe this data loss is going to produce bad results.
Another solution could be to concatenate N embeddings (with N constant) in a single vector and pass that 512xN vector to DBSCAN, but the problem with this is that the order in which the embeddings are appended to this vector is going to produce different results.
Anyone has faced this same problem?
deepface wraps facenet face recognition model. The regular face recognition process is shown below.
#!pip install deepface
from deepface import DeepFace
my_set = [
["img1.jpg", "img2.jpg"],
["img1.jpg", "img3.jpg"],
]
obj = DeepFace.verify(my_set, model_name = 'Facenet')
for i in obj:
print(i["distance"])
If you need the embeddings generated by facenet, you can adopt deepface as well.
from deepface.commons import functions
from deepface.basemodels import Facenet
model = Facenet.loadModel()
#this detects and aligns faces. Facenet expects 160x160x3 shaped inputs.
img1 = functions.preprocess_face("img1.jpg", target_size = (160, 160))
img2 = functions.preprocess_face("img2.jpg", target_size = (160, 160))
#this finds embeddings for images
img1_embedding = model.predict(img1)
img2_embedding = model.predict(img2)
Embeddings will be 128 dimensional vectors for Facenet. You can run any clustering algorithm to embeddings. I have applied k-means for this kind of a study. I haven't any experience about dbscan but you can apply it if you have the embeddings.
Besides, you can adopt different face recognition models within deepface such as vgg-face, openface, facebook deepface, deepid and dlib.

Do I have to preprocess test data using neural networks?

I am using Keras (version 2.0.0) and I'd like to make use of pretrained models like e.g. VGG16.
In order to get started, I ran the example of the [Keras documentation site ][https://keras.io/applications/] for extracting features with VGG16:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
model = VGG16(weights='imagenet', include_top=False)
img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
The used preprocess_input() function bothers me
(the function does Zero-centering by mean pixel what can be seen by looking at the source code).
Do I really have to preprocess input data (validation/test data) before using a trained model?
a)
If yes, one can conclude that you always have to be aware of what preprocessing steps have been performed during training phase?!
b)
If no: Does preprocessing of validation/test data cause a bias?
I appreciate your help.
Yes you should use the preprocessing step. You can retrain the model without it but the first layers will learn to center your datas so this is a waste of parameters.
If you do not recenter your performances will suffer.
Great thread on reddit : https://www.reddit.com/r/MachineLearning/comments/3q7pjc/why_is_removing_the_mean_pixel_value_from_each/