I am using Mallet api to extract topic from twitter data and I have already
extracted topics which are seems good topic. But I am facing problem
to estimating K.
For example I fixed K value from 10 to 100.
So, I have taken different number of topics from the data.
But, now I would like to estimate which K is best.
There are some algorithm I know as
Perplexity
Empirical likelihood
Marginal likelihood (Harmonic mean method)
Silhouette
I found a method model.estimate() which may be used to estimate with different value of K.
But I am not getting any idea to show the value of K is best for the model.
Does anyone give some idea about it with some sample code? Thanks.
Related
I was reading through all (or most) previously asked questions, but couldn't find an answer to my problem...
I have 13 variables measured on an ordinal scale (thy represent knowledge transfer channels), which I want to cluster (HCA) for a following binary logistic regression analysis (including all 13 variables is not possible due to sample size of N=208). A Factor Analysis seems inappropriate due to the scale level. I am using SPSS (but tried R as well).
Questions:
1: Am I right in using the Chi-Squared measure for count data instead of the (squared) euclidian distance?
2. How can I justify a choice of method? I tried single, complete, Ward and average, but all give different results and I can't find a source to base my decision on.
Thanks a lot in advance!
Answer 1: Since the variables are on ordinal scale, the chi-square test is an appropriate measurement test. Because, "A Chi-square test is designed to analyze categorical data. That means that the data has been counted and divided into categories. It will not work with parametric or continuous data (such as height in inches)." Reference.
Again, ordinal scaled data is essentially count or frequency data you can use regular parametric statistics: mean, standard deviation, etc Or non-parametric tests like ANOVA or Mann-Whitney U test to compare 2 groups or Kruskal–Wallis H test to compare three or more groups.
Answer 2: In a clustering problem, the choice of distance method solely depends upon the type of variables. I recommend you to read these detailed posts 1, 2,3
I am working on a UCI data set about wine quality. I have applied multiple classifiers and k-nearest neighbor is one of them. I was wondering if there is a way to find the exact value of k for nearest neighbor using 5-fold cross validation. And if yes, how do I apply that? And how can I get the depth of a decision tree using 5-fold CV?
Thanks!
I assume here that you mean the value of k that returns the lowest error in your wine quality model.
I find that a good k can depend on your data. Sparse data might prefer a lower k whereas larger datasets might work well with a larger k. In most of my work, a k between 5 and 10 have been quite good for problems with a large number of cases.
Trial and Error can at times be the best tool here, but it shouldn't take too long to see a trend in the modelling error.
Hope this Helps!
These days I am using some clustering algorithm and I just wanted to ask a question related to this field. Maybe those who are working in this field already have this answer.
During clustering I need to have some training data which I am going to cluster. The number of iterations (e.x. K-Means algorithm) is depended on the number of training data(number of vectors). Is there any method to find the most important data from training data. What I mean is: Instead of training the K-Means with all the data maybe there is a method to find just the important vectors (those vectors who affect most the clusters) and use these "important" vectors(from training data) to traing the algorithm.
I hope you understood me.
Thank You for reading and trying to answer.
"Training" and "Test" data is a concept from classification, not from cluster analysis.
K-means is a statistical method. If you want to speed it up, running it on a large enough random sample should give you nearly the same result.
I'm currently using Matlab's k nearest neighbors classifier (knnclassify) to train and test binary attributes. The default value argument for k if none provided is 1 and one can choose other values of k. I've done research online and on stackoverflow but nothing relevant came up to address my question for what value of k would be of best use. Is there a built in function that can tell me that for my particular data or is it simply guess and wait to see what accuracy is derived? Any help will be greatly appreciated.
Here is the link to matlab's knnclassify documentation: knnclassify
What you have here is a typical model selection problem. What you want is to pick the k that gives you the lowest overall error on your data. Larger values of k generalize better, and smaller values may tend to overfit.
Hence, cross-validation is a good way to choose this parameter and I found the this article, which seems like a reasonable method.
I am trying to differentiate two populations. Each population is an NxM matrix in which N is fixed between the two and M is variable in length (N=column specific attributes of each run, M=run number). I have looked at PCA and K-means for differentiating the two, but I was curious of the best practice.
To my knowledge, in K-means, there is no initial 'calibration' in which the clusters are chosen such that known bimodal populations can be differentiated. It simply minimizes the distance and assigns the data to an arbitrary number of populations. I would like to tell the clustering algorithm that I want the best fit in which the two populations are separated. I can then use the fit I get from the initial clustering on future datasets. Any help, example code, or reading material would be appreciated.
-R
K-means and PCA are typically used in unsupervised learning problems, i.e. problems where you have a single batch of data and want to find some easier way to describe it. In principle, you could run K-means (with K=2) on your data, and then evaluate the degree to which your two classes of data match up with the data clusters found by this algorithm (note: you may want multiple starts).
It sounds to like you have a supervised learning problem: you have a training data set which has already been partitioned into two classes. In this case k-nearest neighbors (as mentioned by #amas) is probably the approach most like k-means; however Support Vector Machines can also be an attractive approach.
I frequently refer to The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) by Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman (Author).
It really depends on the data. But just to let you know K-means does get stuck at local minima so if you wanna use it try running it from different random starting points. PCA's might also be useful how ever like any other spectral clustering method you have much less control over the clustering procedure. I recommend that you cluster the data using k-means with multiple random starting points and c how it works then you can predict and learn for each the new samples with K-NN (I don't know if it is useful for your case).
Check Lazy learners and K-NN for prediction.