What is the relation between topic modeling and document clustering? - cluster-analysis

Topic modeling identifies distribution of topics in a document collection, which effectively identifies the clusters in the collection. So is it right to say that topic modeling is a technique to do document clustering?

A topic is quite different from a cluster of docs, after all, a topic is not composed of docs.
However, these two techniques are indeed related. I believe Topic Modeling is a viable way of deciding how similar documents are, hence a viable way for document clustering.
In representing each document as a topic distribution (actually a vector), topic modeling techniques reduce the feature dimensionality from number of distinct words appeared (in a corpus) to the number of topics. Similarity between docs' Topic distributions can be calculated using Cosine metrics and many other metrics, which reflect the similarity of the docs themselves in terms of the topics/themes they cover. Based on this quantified similarity measure, many clustering algorithms can be applied to group the documents.
And in this sense, I think it is right to say that topic modeling is a technique to do document clustering.

The relation between clustering and classification is very similar to the relation between topic modeling and multi-label classification.
In single-label multi-class classification we assign just one label per each document. And in clustering we put each document in just one group. The fact is that we can't define the clusters in advance as we define labels. If we ignore this fact, grouping and labeling are essentially the same thing.
However, in real world problems flat classification is not sufficient. Often documents are related to multiple categories/classes. Thus we leverage the multi-label classification. Now, we can see the topic modeling as the unsupervised version of multi-label classification as we can put each document under multiple groups/topics. Here again, I'm ignoring the fact that we can't decide what topics to use as labels in advance.

Related

Latent Dirichlet Allocation and Analyzing Two Data Sets using MALLET

I am currently analyzing two datasets. Dataset A has about 600000+ documents whereas Dataset B has about 7000+ documents. Does this mean that the topic outputs will be more about Dataset A because it has a larger N? The output of mallet in Rapidminer still accounts for which documents fall under each topic. I wonder if there is a way to make the two datasets be interpreted with equal weights?
I am assuming you're mixing the two documents in the training corpus altogether and peform the training. Under this assumption, then it is very likely that the topic outputs will be more about dataset "coming" from A rather than B, as the Gibbs sampling would construct topics according to the co-occurence of tokens which most likely falls from A as well. However inter-topics or similarity of topic across two datasets overlaps is also possible.
You can sample document A instead so that it has same number of documents as B, assuming their topics structure is not that different. Or, you can check the log output from --output-state parameter to see exactly the assigned topic (z) for each token.

Find number of clusters in DBLP dataset

I am trying to find the number of clusters in DBLP V11 dataset using field of study.
I've tried using doc2vec pretrained and average on word2vec pretrained and clustering the results using DBSCAN, hierarchical clustering and get the number of clusters using elbow method, silhouette method and gap statistics.
I get one or two clusters from this because all the articles are computer science related, but I need to find out the number of subfields from computer science.
There is not "the" number of clusters in such data.
Instead, many answers are correct. Or none.
Is machine learning part of artificial intelligence? Is deep learning a separate topic? And data science? how is data science different from statistics? Doesn't statistics have lots of subtopics? What about big data, and how does it relate to data science? Isn't data mining the same as data science? Humans won't all agree on all of these topics either.

What is the type or family of recsys algorithms for recommending similar users based on their interests?

I am learning recommendation systems from Coursera MooC. I see there are majorly three types of filtering methods (in introduction course).
a. Content-based filtering
b. Item-Item collaborative filtering
c. User-User collaborative filtering
Having understood this, I am not sure where does the - similar users recommendation based on the interests/preferences belong to? For example, consider I have User->TopicsOfInterest0..n relation. I want to recommend other similar users based on their respective TopicsOfInterest (vector).
I'm not sure that these three types are an exhaustive classification of all recommender systems.
In fact, any matrix-factorization based algorithm (SVD, etc.) is both item-based and user-based at the same time. But the TopicsOfInterest (factors) are inferred automatically by the algorithm. For example, Apache Spark includes an implementation of the alternating least squares (ALS) algorithm. Spark's API has the userFeatures method, which returns (roughly) a matrix, predicting users's attitude to each feature.
The only thing left to do is to compute a set of most similar users to a given one (e.g. find vectors, that are closest to a given one by cosine similarity).

choose the proper clustering method for Latent Semantic Analysis

i want to cluster some text document to find the document with the same concept. i've done the semantic similarity using Latent Semantic Analysis (LSA), but i confuse which clustering method that i should choose for my purpose .
Thank you
You can use hierarchical clustering. There is a package in R called RClusterpp which is very efficient for hierarchical clustering of large data (it does a parallel computation). Then you can cut the dendrogram tree for different number of cluster within the possible range and check for cluster profiles using cross-tab.

strategies and clustering algorithms for topic detection

i want to know good strategies or algorithms to solve the following problem:
What i have is:
A set of news articles from different sources with a time-stamp and a weighted vector of news categories for each article.
What i want is:
Clusters of articles from different sources that deal with the same topic.
I basically want to copy the key feature of google news: presenting topics and listing different news sources for the same topic.
I already have nice features for the articles like the above mentioned vector of news categories, want i need to do know is chose the right strategy, clustering algorithm and library to do the clustering.
Features the clustering algorithm should have:
no fixed number of clusters, (i don't know in advance how many
topics are present in my article set).
efficiently map new articles to existing clusters, or create a new cluster if the
articles doesn't fit good enough to existing clusters.
Take into account the time-stamp of articles for similarity.
Dissolve clusters if to get outdated and removed from the underlying article set.
I never did any clustering so I don't know if there is a clustering algorithm that provides the above features or if some of these features are too complicated or make clustering way to slow so that I need to find a workaround for them.
Right know I'm looking at mahout as a library for clustering. Are there any ready to use open source implementations for Topic detection with mahout or maybe with another library?
I think the following paper is one of the best approaches I have yet encountered for topic detection when you do not know the number of clusters already.
http://www.uni-weimar.de/medien/webis/research/events/tir-08/tir08-papers-final/wartena08-topic-detection-by-clustering-keywords.pdf