Reverse TF-IDF vector (vec2text) - hash

Given a generated doc2vec vector on some document. is it possible to reverse the vector back to the original document?
If so, does there exist any hash algorithm that would make the vector irreversible but still comparable to other vectors of the same type (using cosine/Euclidean distance)?

It's unclear why you've mentioned "TF-IDF vector" in your question title, but then asked about a Doc2Vec vector – which is very different from a TF-IDF approach. I'll assume your main interest is Doc2Vec vectors.
In general, a Doc2Vec vector has far too little information to actually reconstruct the document for which the vector was calculated. It's essentially a compressed summary, based on evolving (in training or inference) a vector that's good (within the limits of the model) at predicting the document's words.
For example, one commonly-used dimensionality for Doc2Vec vectors is 300. Those 300 dimensions are each represented by a 4-byte floating-point value. So the vector is 1200 bytes in total - but could be the summary vector for a document of many hundreds or thousands of words, far far larger than 1200 bytes.
It's theoretically plausible that with a Doc2Vec vector, and the associated model from which it was trained or inferred, you could generate a ranked list of words most-likely to be in the document. There's a pending feature-request to offer this in Gensim (#2459), but not yet implementing code. But such a list-of-words wouldn't be grammatical, and the top 10 words in such a list might not be in the document at all. (It might be entirely made up of other similar words.)
With a large set of calculated vectors, as you get when training of a model has finished, you could take a vector (from that set, or from inferring a new text), and look through the set-of-vectors for whichever one has a vector closest to your query vector. That would point you at one of your known documents - but that's more of a lookup (when you already know many example documents) than reversing a vector into a document directly.
You'd have to say more about your need for a 'irreversible' vector that is still good for document-to-document comparisons for me to make further suggestions to meet that need.
To an extent, Doc2Vec vectors already meet that need, as they can't regenerate an exat document. But given that they could generate a list of likely words (per above), if your needs are more rigorous, you might need extra steps. For example, if you used a model to calcualte all needed vectors, but then threw away the model, even that theoretical capability to list most-likely words would go away.
But to the extent you still have the vectors, and potentially their mappings to full documents, a vector still implies one, or a few, closest-documents from the known set. And even if you somehow had a novel vector, without its text, simply looking among your known documents that are closest would be highly suggestive (but not dispositive) about what words are in the source document.
(If your needs are very demanding, there might be something in the genre of 'Fully Homomorphic Encryption' and/or 'Private Information Retrieval' that would help. Those use advanced cryptography to allow queries on encrypted data that only reveal final results, hiding the details of what you're doing even from the system answering your query. But those techniques are far more new & complicated, with few if any sources of ready-to-use code, and adapting them specifically for vector-similarity style calculations might require significant custom advanced-cryptography work.)

Related

Is it possible to use evaluation metrics (like NDCG) as a loss function?

I am working on a Information Retrieval model called DPR which is a basically a neural network (2 BERTs) that ranks document, given a query. Currently, This model is trained in binary manners (documents are whether related or not related) and uses Negative Log Likelihood (NLL) loss. I want to change this binary behavior and create a model that can handle graded relevance (like 3 grades: relevant, somehow relevant, not relevant). I have to change the loss function because currently, I can only assign 1 positive target for each query (DPR uses pytorch NLLLoss) and this is not what I need.
I was wondering if I could use a evaluation metric like NDCG (Normalized Discounted Cumulative Gain) to calculate the loss. I mean, the whole point of a loss function is to tell how off our prediction is and NDCG is doing the same.
So, can I use such metrics in place of loss function with some modifications? In case of NDCG, I think something like subtracting the result from 1 (1 - NDCG_score) might be a good loss function. Is that true?
With best regards, Ali.
Yes, this is possible. You would want to apply a listwise learning to rank approach instead of the more standard pairwise loss function.
In pairwise loss, the network is provided with example pairs (rel, non-rel) and the ground-truth label is a binary one (say 1 if the first among the pair is relevant, and 0 otherwise).
In the listwise learning approach, however, during training you would provide a list instead of a pair and the ground-truth value (still a binary) would indicate if this permutation is indeed the optimal one, e.g. the one which maximizes nDCG. In a listwise approach, the ranking objective is thus transformed into a classification of the permutations.
For more details, refer to this paper.
Obviously, the network instead of taking features as input may take BERT vectors of queries and the documents within a list, similar to ColBERT. Unlike ColBERT, where you feed in vectors from 2 docs (pairwise training), for listwise training u need to feed in vectors from say 5 documents.

MANOVA - huge matrices

First, sorry by the tag as "ANOVA", it is about MANOVA (yet to become a tag...)
From the tutorials I found, all the examples use small matrices, following them would not be feasible for the case of big ones as it is the case of many studies.
I got 2 matrices for my 14 sampling points, 1 for the organisms IDs (4493 IDs) and other to chemical profile (190 variables).
The 2 matrices were correlated by spearman and based on the correlation, split in 4 clusters (k-means regarding the square euclidian clustering values), the IDs on the row and chemical profile on line.
The differences among them are somewhat clear, but to have it in a more robust way I want to perform MANOVA to show the differences between and within the clusters - that is a key factor for the conclusion, of course.
Problem is that, after 8h trying, could not even input the data in a format acceptable to the analysis.
The tutorials I found are designed to very few variables and even when I think I overcame that, the program says that my matrices can't be compared by their difference in length.
Each cluster has its own set of IDs sharing all same set of variables.
What should I do?
Thanks in advance.
Diogo Ogawa
If you have missing values in your data (which practically all data sets seem to contain) you can either remove those observations or you can create a model using those observations. Use the first approach if something about your methodology gives you conviction that there is something different about those observations. Most of the time, it is better to run the model using the missing values. In this case, use the general linear model instead of a balanced ANOVA model. The balanced model will struggle with those missing data.

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.

Data clustering algorithm

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast?
I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.
To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.
PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )
Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.
There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).
The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.
A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.
The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.
I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.
In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.
Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.