I use scikit-learn to cluster my data, and wish to evaluate the results.
I wonder if there is a built-in function that calculates TP, TN, FP, FN according to pairs of documents, as explained in Introduction to Information Retrieval, Ch.16, p.359 (http://nlp.stanford.edu/IR-book/pdf/16flat.pdf)?
Thanks,
Alon
Have a look at the sklearn.metrics.cluster package, and sklearn.metrics.adjusted_rand_score.
I don't know if they expose the 2 by 2 matrix, but there is functionality to compute some of the most popular evaluation metrics.
Related
I am working on a Information Retrieval model called DPR which is a basically a neural network (2 BERTs) that ranks document, given a query. Currently, This model is trained in binary manners (documents are whether related or not related) and uses Negative Log Likelihood (NLL) loss. I want to change this binary behavior and create a model that can handle graded relevance (like 3 grades: relevant, somehow relevant, not relevant). I have to change the loss function because currently, I can only assign 1 positive target for each query (DPR uses pytorch NLLLoss) and this is not what I need.
I was wondering if I could use a evaluation metric like NDCG (Normalized Discounted Cumulative Gain) to calculate the loss. I mean, the whole point of a loss function is to tell how off our prediction is and NDCG is doing the same.
So, can I use such metrics in place of loss function with some modifications? In case of NDCG, I think something like subtracting the result from 1 (1 - NDCG_score) might be a good loss function. Is that true?
With best regards, Ali.
Yes, this is possible. You would want to apply a listwise learning to rank approach instead of the more standard pairwise loss function.
In pairwise loss, the network is provided with example pairs (rel, non-rel) and the ground-truth label is a binary one (say 1 if the first among the pair is relevant, and 0 otherwise).
In the listwise learning approach, however, during training you would provide a list instead of a pair and the ground-truth value (still a binary) would indicate if this permutation is indeed the optimal one, e.g. the one which maximizes nDCG. In a listwise approach, the ranking objective is thus transformed into a classification of the permutations.
For more details, refer to this paper.
Obviously, the network instead of taking features as input may take BERT vectors of queries and the documents within a list, similar to ColBERT. Unlike ColBERT, where you feed in vectors from 2 docs (pairwise training), for listwise training u need to feed in vectors from say 5 documents.
I'm looking for a clustering method than can find an interpretable optimal subspace.For example, if I have a dataset consistes of some features [feature_1, feature_2,...,feature_n], after clusering, I can get a clustering result and a subspace [feature_3, feature_6,...,feature_9], this subspace can interpret why any one of clustering can be clustered together.
I've tried subkmeans, it is similar with PCA, but subkmeans will transform the original dataset, although it can find an optimal subspace. Since subkmeans transformed my dataset that result in I can't find the corresponding features(that is subspace I need), so I what to ask is there a clustering method that can find this subspace.
Yes. If you google a little bit you will find plenty of subspace clustering methods that select features relevant for the cluster.
See Wikipedia.
There are a lot of validity index for clustering, but just for numeric data. what about clustering for mixed data (numeric and categorical) ?
The same way, mostly.
You obviously can't use inertia, but anything that is distance based (and doesn't use the cluster mean) will work with the distance you used for clustering. E.g., Silhouette.
Unfortunately, the distance functions for such data are not very trustworthy in my opinion. So good luck, and triple check all results before using them, as you may have non-meaningful results that only look good when condensed to this single score number.
Well, I have been studying up on the different algorithms used for clustering like k-means, k-mediods etc and I was trying to run the algorithms and analyze their performance on the leaf dataset right here:
http://archive.ics.uci.edu/ml/datasets/Leaf
I was able to cluster the dataset via k-means by first reading the csv file, filtering out unneeded attributes and applying k-means on it. The problem that I am facing here that I wish to calculate measures such as entropy, precision, recall and f-measure for the model developed via k-means. Is there an operator avialable that allows me to do this so that I can quantitatively compare the different clustering algorithms available on rapid-miner?
P.S I know about performance operators like Performance(Classification) that allows me to calculate precision and recall for a model but I dont know any that allow me to calculate entropy.
Help would be much appreciated.
The short answer is to use R. Here's a link to a book chapter about this very subject. There is a revised version coming soon that works for the most recent version of RapidMiner.
I have a large dataset of multidimensional data (240 dimensions).
I am a beginner at performing data mining and I want to apply Linear Discriminant Analysis by using MATLAB. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.
Basically, I want to apply LDA.
After this step I want to be able to do a reconstruction for my data.
I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.
My initial data is something like: size(x) = [2000 240]. So basically I have 240 features (dimensions) and 2000 data points. And I want to perform LDA on this data set.
The function classify from Statistics Toolbox does Linear (and, if you set some options, Quadratic) Discriminant Analysis. There are a couple of worked examples in the documentation that explain how it should be used: type doc classify or showdemo classdemo to see them.
240 features is quite a lot given that you only have 2000 observations, even if you have only two classes. You might want to apply a dimension reduction method before LDA, such as PCA (see doc princomp) or use a feature selection method (see doc sequentialfs for one such method).
you can use fitcdiscr for classification using LDA in matlab 2014