Find number of clusters in DBLP dataset - cluster-analysis

I am trying to find the number of clusters in DBLP V11 dataset using field of study.
I've tried using doc2vec pretrained and average on word2vec pretrained and clustering the results using DBSCAN, hierarchical clustering and get the number of clusters using elbow method, silhouette method and gap statistics.
I get one or two clusters from this because all the articles are computer science related, but I need to find out the number of subfields from computer science.

There is not "the" number of clusters in such data.
Instead, many answers are correct. Or none.
Is machine learning part of artificial intelligence? Is deep learning a separate topic? And data science? how is data science different from statistics? Doesn't statistics have lots of subtopics? What about big data, and how does it relate to data science? Isn't data mining the same as data science? Humans won't all agree on all of these topics either.

Related

Clustering in Weka

I have some data collected using an online survey. Therefore, there are no classes/labels in the data to evaluate clustering results. I am trying to do the clustering in order to cluster participants in some groups for another task.
In the data, I have 10 attributes like: Age, Gender, etc., and 111 examples or data-points.
It's my first time to perform clustering and it's been difficult to find potential clusters in the data.
Here are the steps I have performed in Weka:
I have tried to cluster the data using all attributes, all types of clustering in Weka (like cobweb, EM .. etc) and using different cluster numbers (1-10). And When I visualise the clusters, they don't make any sense and the data are widely spread between x and y axis.
I have applied PCA and selected different number of attribute combinations according to the ranks obtained in PCA. The best clustering result was obtained using k-means and with only 2 combinations of attributes and the number of clusters selected was 3, and seed was 7 (sorry, I have no idea what the seed is).
My Questions:
Are the steps I performed to cluster data correct? If not please give me advice/s
Is this considered as a good clustering result?
How can I optimise or enhance my clusters?
What is meant with seed in Weka clustering?

How to cluster data using self-organising maps?

Suppose that we train a self-organising map (SOM) with a given dataset. Would it make sense to cluster the neurons of the SOM instead of the original datapoints? This doubt came to me after reading this paper, in which the following is stated:
The most important benefit of this procedure
is that computational load decreases considerably, making
it possible to cluster large data sets and to consider several
different preprocessing strategies in a limited time. Naturally,
the approach is valid only if the clusters found using the SOM
are similar to those of the original data.
In this answer it is clearly stated that SOMs don't include clustering, but some clustering procedure can be made on the SOM after it has been trained. I thought that this meant the clustering was done on the neurons of the SOM, which are in some sense a mapping of the original data, but I'm not sure about this. So, what I want to know is:
Is it correct to cluster data performing the clustering algorithm on the trained neuron weights as datapoints? If not, how is clustering done using a SOM then?
What characteristics should a dataset have, in general, for this approach to be useful?
Yes, the usual approach seems to be either hierarchical or k-means (you'll need to dig this up how it was originally done - as seen in the paper you linked, many variants including two-level approaches have been explored later) on the neurons. If you consider SOMs to be a quantization and projection technique, all of these approaches are valid to use.
It's cheaper because they are just 2 dimensional, Euclidean, and much fewer points. So that is well in line with the source that you have.
Note that a SOM neuron may be empty, it it is inbetween of two extremely well separated clusters.

What is the relation between topic modeling and document clustering?

Topic modeling identifies distribution of topics in a document collection, which effectively identifies the clusters in the collection. So is it right to say that topic modeling is a technique to do document clustering?
A topic is quite different from a cluster of docs, after all, a topic is not composed of docs.
However, these two techniques are indeed related. I believe Topic Modeling is a viable way of deciding how similar documents are, hence a viable way for document clustering.
In representing each document as a topic distribution (actually a vector), topic modeling techniques reduce the feature dimensionality from number of distinct words appeared (in a corpus) to the number of topics. Similarity between docs' Topic distributions can be calculated using Cosine metrics and many other metrics, which reflect the similarity of the docs themselves in terms of the topics/themes they cover. Based on this quantified similarity measure, many clustering algorithms can be applied to group the documents.
And in this sense, I think it is right to say that topic modeling is a technique to do document clustering.
The relation between clustering and classification is very similar to the relation between topic modeling and multi-label classification.
In single-label multi-class classification we assign just one label per each document. And in clustering we put each document in just one group. The fact is that we can't define the clusters in advance as we define labels. If we ignore this fact, grouping and labeling are essentially the same thing.
However, in real world problems flat classification is not sufficient. Often documents are related to multiple categories/classes. Thus we leverage the multi-label classification. Now, we can see the topic modeling as the unsupervised version of multi-label classification as we can put each document under multiple groups/topics. Here again, I'm ignoring the fact that we can't decide what topics to use as labels in advance.

Clustering of data - Pre- processing of data

These days I am using some clustering algorithm and I just wanted to ask a question related to this field. Maybe those who are working in this field already have this answer.
During clustering I need to have some training data which I am going to cluster. The number of iterations (e.x. K-Means algorithm) is depended on the number of training data(number of vectors). Is there any method to find the most important data from training data. What I mean is: Instead of training the K-Means with all the data maybe there is a method to find just the important vectors (those vectors who affect most the clusters) and use these "important" vectors(from training data) to traing the algorithm.
I hope you understood me.
Thank You for reading and trying to answer.
"Training" and "Test" data is a concept from classification, not from cluster analysis.
K-means is a statistical method. If you want to speed it up, running it on a large enough random sample should give you nearly the same result.

Why we need training and test datasets in research?

I'm newbie in research area of data mining (text clustering) and i have couple question regarding to training and test datasets.
Is that clustering need training and testing datasets?
why we need to separate into training and test datasets?
Sorry for the rookie question hope expert in this group can help me.
As your question is on clustering:
In cluster analysis, there usually is no training or test data split.
Because you do cluster analysis when you do not have labels, so you cannot "train".
Training is a concept from machine learning, and train-test splitting is used to avoid overfitting.
But if you are not learning labels, you cannot overfit.
Properly used cluster analysis is a knowledge discovery method. You want to discover some new structure in your data, not rediscover something that is already labeled.
To train your data you need a sets of relevant data similar but not identical to your testing data. For example, you could split up your data where 0.7 of your data is training and the rest testing. This will allow your algorithm to get a feel for what it should be looking for. The rest of the data 0.3 can be used for testing as it is a distinct set of information (hopefully) which should allow the algorithm to test itself.
Why split it up?
Well if you train your data on data A and then test your algorithm on data A your algorithm will be able to identify all the information correctly because that is what it was trained on.
For example, if when learning addition you were given the sums 3+4, 4+5, 6+9, which you correctly solved it would be redundant to test your knowledge of addition using the same sums.
further information:
http://en.wikipedia.org/wiki/Natural_language_processing
http://www.nltk.org/book
Hope this helps.