I want to make cluster of sentences but now i don't know that how many cluster will be made - cluster-analysis

I have calculated the embedding with the help of doc2vec and I have also calculated the distance between sentences in vector form. now I have a vector of sentences that tells the distance between them(sentences). how can I cluster them without giving the number of clusters? I have used k-means and agglomerative algo but they are not giving me good results. can anybody tell me the best method to determine the optimal number of clusters?

Try this. If it doesn't do what you want, I have a few other code samples to share. This may be the best option. The best option to use, can change, based on the dataset that you feed into the algo.
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = "kitten belly squooshy merley best eating google feedback face extension impressed map feedback google eating face extension climbing key".split(" ") #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Result:

Related

AffinityPropagation with Sentence Transformar not converging

I'm trying to cluster a text dataset using Sentence Transformer and Affinity Propagation but I keep getting "ConvergenceWarning: Affinity propagation did not converge, this model will not have any cluster centers." and, therefore, no clustering. I'm trying to figure it out for some time but couldn't make it work and there is few documentation online for text clustering using this algorithm.
from sklearn.cluster import AffinityPropagation
import numpy as np
model = SentenceTransformer('sentence-transformers/paraphrase-xlm-r-multilingual-v1')
sentence_embeddings = model.encode(texts)
affprop = AffinityPropagation()
af = affprop.fit(sentence_embeddings)
cluster_centers_indices = af.cluster_centers_indices_
len(cluster_centers_indices) # this line returns zero
Did someone get this same problem and have some workaround to suggest?

How to use FaceNet and DBSCAN on multiple embeddings identities?

I have the following setting:
A surveillance system take photos of people's faces (there are a varying number of photos for each person).
I run FaceNet for each photo and get a list of embedding vectors for each person (each person is represented by a list of embeddings, not by a single one).
The problem:
I want to cluster observed people using DBSCAN, but I need to guarantee that face embeddings from the same people go to the same cluster (remember we can have multiple photos of the same people, and we already know they must belong to the same cluster).
One solution could be to get a "mean" or average embedding for each person, but I believe this data loss is going to produce bad results.
Another solution could be to concatenate N embeddings (with N constant) in a single vector and pass that 512xN vector to DBSCAN, but the problem with this is that the order in which the embeddings are appended to this vector is going to produce different results.
Anyone has faced this same problem?
deepface wraps facenet face recognition model. The regular face recognition process is shown below.
#!pip install deepface
from deepface import DeepFace
my_set = [
["img1.jpg", "img2.jpg"],
["img1.jpg", "img3.jpg"],
]
obj = DeepFace.verify(my_set, model_name = 'Facenet')
for i in obj:
print(i["distance"])
If you need the embeddings generated by facenet, you can adopt deepface as well.
from deepface.commons import functions
from deepface.basemodels import Facenet
model = Facenet.loadModel()
#this detects and aligns faces. Facenet expects 160x160x3 shaped inputs.
img1 = functions.preprocess_face("img1.jpg", target_size = (160, 160))
img2 = functions.preprocess_face("img2.jpg", target_size = (160, 160))
#this finds embeddings for images
img1_embedding = model.predict(img1)
img2_embedding = model.predict(img2)
Embeddings will be 128 dimensional vectors for Facenet. You can run any clustering algorithm to embeddings. I have applied k-means for this kind of a study. I haven't any experience about dbscan but you can apply it if you have the embeddings.
Besides, you can adopt different face recognition models within deepface such as vgg-face, openface, facebook deepface, deepid and dlib.

Pytorch - how to undersample using weightedrandomsampler

I have an unbalanced dataset and would like to undersample the class that is overrepresented.How do I go about it. I would like to use to weightedrandomsampler but I am also open to other suggestions.
So far I am assuming that my code will have to be structured kind of like the following. But I dont know how to exaclty do it.
trainset = datasets.ImageFolder(path_train,transform=transform)
...
sampler = data.WeightedRandomSampler(weights=..., num_samples=..., replacement=...)
...
trainloader = data.DataLoader(trainset, batchsize = batchsize, sampler=sampler)
I hope someone can help. Thanks a lot
From my understanding, pytorch WeightedRandomSampler 'weights' argument is somewhat similar to numpy.random.choice 'p' argument which is the probability that a sample will get randomly selected. Pytorch uses weights instead to random sample training examples and they state in the doc that the weights don't have to sum to 1 so that's what I mean that it's not exactly like numpy's random choice. The stronger the weight, the more likely that sample will get sampled.
When you have replacement=True, it means that training examples can be drawn more than once which means you can have copies of training examples in your train set that get used to train your model; oversampling. Alongside, if the weights are low COMPARED TO THE OTHER TRAINING SAMPLE WEIGHTS the opposite occurs which means that those samples have a lower chance of being selected for random sampling; undersampling.
I have no clue how the num_samples argument works when using it with the train loader but I can warn you to NOT put your batch size there. Today, I tried putting the batch size and it gave horrible results. My co-worker put the number of classes*100 and his results were much better. All I know is that you should not put the batch size there. I also tried putting the size of all my training data for num_samples and it had better results but took forever to train. Either way, play around with it and see what works best for you. I would guess that the safe bet is to use the number of training examples for the num_samples argument.
Here's the example I saw somebody else use and I use it as well for binary classification. It seems to work just fine. You take the inverse of the number of training examples for each class and you set all training examples with that class its respective weight.
A quick example using your trainset object
labels = np.array(trainset.samples)[:,1] # turn to array and take all of column index 1 which are the labels
labels = labels.astype(int) # change to int
majority_weight = 1/num_of_majority_class_training_examples
minority_weight = 1/num_of_minority_class_training_examples
sample_weights = np.array([majority_weight, minority_weight]) # This is assuming that your minority class is the integer 1 in the labels object. If not, switch places so it's minority_weight, majority_weight.
weights = samples_weights[labels] # this goes through each training example and uses the labels 0 and 1 as the index in sample_weights object which is the weight you want for that class.
sampler = WeightedRandomSampler(weights=weights, num_samples=, replacement=True)
trainloader = data.DataLoader(trainset, batchsize = batchsize, sampler=sampler)
Since the pytorch doc says that the weights don't have to sum to 1, I think you can also just use the ratio which between the imbalanced classes. For example, if you had 100 training examples of the majority class and 50 training examples of the minority class, it would be a 2:1 ratio. To counterbalance this, I think you can just use a weight of 1.0 for each majority class training example and a weight 2.0 for all minority class training examples because technically you want the minority class to be 2 times more likely to be selected which would balance your classes during random selection.
I hope this helped a little bit. Sorry for the sloppy writing, I was in a huge rush and saw that nobody answered. I struggled through this myself without being able to find any help for it either. If it doesn't make sense just say so and I'll re-edit it and make it more clear when I get free time.
Based on torchdata (disclaimer: I'm the author) one can create a custom undersampler.
First, _Equalizer base class which:
creates multiple RandomSubsetSamplers (one for each class)
based on function (torch.max or torch.min) will behave as oversampler or undersampler
Code:
class _Equalizer(Sampler):
def __init__(self, labels: torch.tensor, function):
if len(labels.shape) > 1:
raise ValueError(
"labels can only have a single dimension (N, ), got shape: {}".format(
labels.shape
)
)
tensors = [
torch.nonzero(labels == i, as_tuple=False).flatten()
for i in torch.unique(labels)
]
self.samples_per_label = getattr(builtins, function)(map(len, tensors))
self.samplers = [
iter(
RandomSubsetSampler(
tensor,
replacement=len(tensor) < self.samples_per_label,
num_samples=self.samples_per_label
if len(tensor) < self.samples_per_label
else None,
)
)
for tensor in tensors
]
#property
def num_samples(self):
return self.samples_per_label * len(self.samplers)
def __iter__(self):
for _ in range(self.samples_per_label):
for index in torch.randperm(len(self.samplers)).tolist():
yield next(self.samplers[index])
def __len__(self):
return self.num_samples
Now, we can create undersampler (added oversampler as it is really short right now):
class RandomUnderSampler(_Equalizer):
def __init__(self, labels: torch.tensor):
super().__init__(labels, "min")
class RandomOverSampler(_Equalizer):
def __init__(self, labels):
super().__init__(labels, "max")
Just pass in your labels to the __init__ (has to be 1D but can have multiple or binary classes) and you can up/under sample your data.

Do I have to preprocess test data using neural networks?

I am using Keras (version 2.0.0) and I'd like to make use of pretrained models like e.g. VGG16.
In order to get started, I ran the example of the [Keras documentation site ][https://keras.io/applications/] for extracting features with VGG16:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
model = VGG16(weights='imagenet', include_top=False)
img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
The used preprocess_input() function bothers me
(the function does Zero-centering by mean pixel what can be seen by looking at the source code).
Do I really have to preprocess input data (validation/test data) before using a trained model?
a)
If yes, one can conclude that you always have to be aware of what preprocessing steps have been performed during training phase?!
b)
If no: Does preprocessing of validation/test data cause a bias?
I appreciate your help.
Yes you should use the preprocessing step. You can retrain the model without it but the first layers will learn to center your datas so this is a waste of parameters.
If you do not recenter your performances will suffer.
Great thread on reddit : https://www.reddit.com/r/MachineLearning/comments/3q7pjc/why_is_removing_the_mean_pixel_value_from_each/

Scikit-Learn's DPGMM fitting: number of components?

I'm trying to fit a mixed normal model to some data using scikit-learn's DPGMM algorithm. One of the advantages advertised on [0] is that I don't need to specify the number of components; which is good, because I do not know the number of components in my data. The documentation states that I only need to specify an upper bound. However, it looks very much like that is not true:
>>> data = numpy.random.normal(loc = 0.0, scale = 1.0, size = 1000)
>>> from sklearn.mixture import DPGMM
>>> d = DPGMM(n_components=5)
>>> d.fit(data.reshape(-1,1))
DPGMM(alpha=1.0, covariance_type='diag', init_params='wmc', min_covar=None,
n_components=5, n_iter=10, params='wmc', random_state=None, thresh=None,
tol=0.001, verbose=0)
>>> d.n_components
5
>>> d.means_
array([[-0.02283383],
[ 0.06259168],
[ 0.00390097],
[ 0.02934676],
[-0.05533165]])
As you can see, the fitting reports five components (the upper bound) even for data clearly sampled from just one normal distribution.
Am I doing something wrong? Did I misunderstand something?
Thanks a lot in advance,
Lukas
[0] http://scikit-learn.org/stable/modules/mixture.html#dpgmm
I recently had similar doubts about results of this DPGMM implementation. If you check provided example you notice that DPGMM always return model with n_components, now the trick is to remove redundant components. This can be done with predict function.
Unfortunately this important pice is hidden in comment in code example.
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant components
Perhaps look at using an improved sklearn solution for this kind of problem, namely a Bayesian Gaussian Mixture. With this model, the suggested prior number of components must be given, but once trained, the model assigns weightings to each component, which essentially indicate their relevance. Here is a pretty cool visual demo of BGMM in action.
Once you have experimented with training a few BGMMs on your data, you can get a feel for a sensible estimate to the number of components for your given problem.