Latent Dirichlet Allocation and Analyzing Two Data Sets using MALLET - mallet

I am currently analyzing two datasets. Dataset A has about 600000+ documents whereas Dataset B has about 7000+ documents. Does this mean that the topic outputs will be more about Dataset A because it has a larger N? The output of mallet in Rapidminer still accounts for which documents fall under each topic. I wonder if there is a way to make the two datasets be interpreted with equal weights?

I am assuming you're mixing the two documents in the training corpus altogether and peform the training. Under this assumption, then it is very likely that the topic outputs will be more about dataset "coming" from A rather than B, as the Gibbs sampling would construct topics according to the co-occurence of tokens which most likely falls from A as well. However inter-topics or similarity of topic across two datasets overlaps is also possible.
You can sample document A instead so that it has same number of documents as B, assuming their topics structure is not that different. Or, you can check the log output from --output-state parameter to see exactly the assigned topic (z) for each token.

Related

Is it possible to use evaluation metrics (like NDCG) as a loss function?

I am working on a Information Retrieval model called DPR which is a basically a neural network (2 BERTs) that ranks document, given a query. Currently, This model is trained in binary manners (documents are whether related or not related) and uses Negative Log Likelihood (NLL) loss. I want to change this binary behavior and create a model that can handle graded relevance (like 3 grades: relevant, somehow relevant, not relevant). I have to change the loss function because currently, I can only assign 1 positive target for each query (DPR uses pytorch NLLLoss) and this is not what I need.
I was wondering if I could use a evaluation metric like NDCG (Normalized Discounted Cumulative Gain) to calculate the loss. I mean, the whole point of a loss function is to tell how off our prediction is and NDCG is doing the same.
So, can I use such metrics in place of loss function with some modifications? In case of NDCG, I think something like subtracting the result from 1 (1 - NDCG_score) might be a good loss function. Is that true?
With best regards, Ali.
Yes, this is possible. You would want to apply a listwise learning to rank approach instead of the more standard pairwise loss function.
In pairwise loss, the network is provided with example pairs (rel, non-rel) and the ground-truth label is a binary one (say 1 if the first among the pair is relevant, and 0 otherwise).
In the listwise learning approach, however, during training you would provide a list instead of a pair and the ground-truth value (still a binary) would indicate if this permutation is indeed the optimal one, e.g. the one which maximizes nDCG. In a listwise approach, the ranking objective is thus transformed into a classification of the permutations.
For more details, refer to this paper.
Obviously, the network instead of taking features as input may take BERT vectors of queries and the documents within a list, similar to ColBERT. Unlike ColBERT, where you feed in vectors from 2 docs (pairwise training), for listwise training u need to feed in vectors from say 5 documents.

Determining Data Homogeneity/Heterogeneity Using Clustering

I have a sequential data (i.e., that comes one instance per time). I want to determine for an amount of instances accumulated (after a while), if they are stochastic (i.e., sparse), or homogeneous (i.e., there is some correlation).
To do this I am using a sequential K-means. First, two cluster's centers are given, and the data is sequentially clustered into two classes. After a while, if I observed that the data is sparse between the two clusters, then, I say that is stochastic. However, if I observed that the data is mostly accumulated in one cluster (e.g., 70% of the data), then I say that the data is homogeneous.
Is my thinking correct?

Clustering cancer gene expression data at the same time?

I'm working at gene expression data clustering techniques and I have downloaded 35 datasets from web,
We have 35 datasets that each of them represents a type of cancer. Each dataset has its own features. Some of these datasets are shared in several features, and some of them do not share anything from the viewpoint of features.
My question is, how do we ultimately cluster these data, while many of them do not have the same characteristics?
I think that we do the clustering operation on all 35 datasets at the same time.
Is my idea correct?
any help is appreciated.
I assume that when you say heterogenous it'll be things like different gene expression platforms where different genes are present.
You can use any clustering technique, but you'll need to write your own distance metric that takes into account heterogeneity within your dataset. For instance, you could use the correlation of all the genes that are in common between pairwise samples, create a distance matrix from this, then use something like hierarchical clustering on this distance matrix
I think there is no need to write your own distance metric. There already exists plenty of distance metrics that can work for mixed data types. For instance, the gower distance works well for mixed data type. See this post on the same. But if your data contains only continuous values then you can use k-means. You'll also be better off, if the data is preprocessed first.

Word2vec model query

I trained a word2vec model on my dataset using the word2vec gensim package. My dataset has about 131,681 unique words but the model outputs a vector matrix of shape (47629,100). So only 47,629 words have vectors associated with them. What about the rest? Why am I not able to get a 100 dimensional vector for every unique word?
The gensim Word2Vec class uses a default min_count of 5, meaning any words appearing fewer than 5 times in your corpus will be ignored. If you enable INFO level logging, you should see logged messages about this and other steps taken by the training.
Note that it's hard to learn meaningful vectors with few (on non-varied) usage examples. So while you could lower the min_count to 1, you shouldn't expect those vectors to be very good – and even trying to train them may worsen your other vectors. (Low-occurrence words can be essentially noise, interfering with the training of other word-vectors, where those other more-frequent words do have sufficiently numerous/varied examples to be better.)

What is the relation between topic modeling and document clustering?

Topic modeling identifies distribution of topics in a document collection, which effectively identifies the clusters in the collection. So is it right to say that topic modeling is a technique to do document clustering?
A topic is quite different from a cluster of docs, after all, a topic is not composed of docs.
However, these two techniques are indeed related. I believe Topic Modeling is a viable way of deciding how similar documents are, hence a viable way for document clustering.
In representing each document as a topic distribution (actually a vector), topic modeling techniques reduce the feature dimensionality from number of distinct words appeared (in a corpus) to the number of topics. Similarity between docs' Topic distributions can be calculated using Cosine metrics and many other metrics, which reflect the similarity of the docs themselves in terms of the topics/themes they cover. Based on this quantified similarity measure, many clustering algorithms can be applied to group the documents.
And in this sense, I think it is right to say that topic modeling is a technique to do document clustering.
The relation between clustering and classification is very similar to the relation between topic modeling and multi-label classification.
In single-label multi-class classification we assign just one label per each document. And in clustering we put each document in just one group. The fact is that we can't define the clusters in advance as we define labels. If we ignore this fact, grouping and labeling are essentially the same thing.
However, in real world problems flat classification is not sufficient. Often documents are related to multiple categories/classes. Thus we leverage the multi-label classification. Now, we can see the topic modeling as the unsupervised version of multi-label classification as we can put each document under multiple groups/topics. Here again, I'm ignoring the fact that we can't decide what topics to use as labels in advance.