I'm currently learning clustering. I have perform k-mean cluster of average_duration_of_call of subscriber which I store on my database. On first run with 3 centers cluster1( 53.33369 sec)-367 subscriber, cluster2(121.67123 sec)-128 subscriber, cluster3(369.09000 sec)-8 subscriber.
Again I rerun the clustering with center 6 and center obtained are as cluster1(904.66670 sec) -1 subscriber, cluster2(27.7 sec) - 108 subscriber, cluster3(151.58)-43 subscriber, cluster4(95 sec) - 135 subscriber, cluster5(59.5 sec) - 207 subscriber, cluster6(278 sec)-9 subscriber.
Now my question is which is the best cluster and how to find best cluster. Any help from experience are expected (I'm currently using R language)
If you are beginner then I recommend you to start density based clustering so that initial value of K isn't required. You can initially start dbscan clustering using epsilon=10 and minpts= 5 and then check the number of generated clusters. After that, start a smooth increase of epsilon (11, 12, ... 15) and decrease of minpt (4, 3, ..1) and check the number of generated clusters each time. Then the average of these numbers are supposed to reflect the average number of real clusters.
But if you need to apply k-mean clustering then you might find Selection of K in K-means clustering paper useful.
Well, k-means already computes a score for your, the sum-of-squares.
Choose the result that achieved the better score.
However, when you increase k it is natural that the score improves. Obviously, if you set k to the data set size, it will be 0. You then may want to use the BIC or the Silhouette Coefficient (look it up on Wikipedia).
Oh, and consider using a book. This is a classic question, and it should be covered in any good book.
Related
I tried to use K-mean with a high-dimension dataset (CDR data).
After clustering, I would like to represent each cluster with the most informative features which can show the unique/representative characteristic of customers in that cluster.
For example,
Cluster 1: [High: call_duration], [Low: number_of_friends], [High: call_at_night]
Cluster 2: [Low: call_duration], [High: use_promotion]
Cluster 3: [High: internet_usage]
I would like to know that ...
Question 1: How I can find those informative features which can represent each cluster?
Question 2: If there are many informative features, how to measure which one is more representative?
Another problem is "how to measure whether the value is high or low?"
My current solution is applying z-normalization to every feature in every cluster centroids, then I assume that
<-2σ or >2σ is outlier
(-2σ to -1σ) or (1σ to 2σ) is low/high
-1σ to 1σ is medium
Question 3: Is this measurement make sense? Please give me your suggestion.
Train a decision tree to discriminate the clusters.
Or any other feature selection method for classification, because this is now a classification problem.
Hi I am trying to cluster using linkage(). Here is the code I am trying..
Y = pdist(data);
Z = linkage(Y);
T = cluster(Z,'maxclust',4096);
I am getting error as follows
The number of elements exceeds the maximum allowed size in
MATLAB.
Error in ==> linkage at 135
Z = linkagemex(Y,method);
data size is 56710*128. How can I apply the code on small chunks of data and then merge those clusters optimally?? Or any other solution to the problem.
Matlab probably cannot cluster this many objects with this algorithm.
Most likely they use distance matrixes in their implementation. A pairwise distance matrix for 56710 objects needs 56710*56709/2=1,607,983,695 entries, or some 12 GB of RAM; most likely also a working copy of this is needed. Chances are that the default Matlab data structures are not prepared to handle this amount of data (and you won't want to wait for the algorithm to finish either; probably that is why they "allow" only a certain amount).
Try using a subset, and see how well it scales. If you use 1000 instances, does it work? How long does the computation take? If you increase to 2000, how much longer does it take?
I'm working on a feed forward artificial neural network (ffann) that will take input in form of a simple calculation and return the result (acting as a pocket calculator). The outcome wont be exact.
The artificial network is trained using genetic algorithm on the weights.
Currently my program gets stuck at a local maximum at:
5-6% correct answers, with 1% error margin
30 % correct answers, with 10% error margin
40 % correct answers, with 20% error margin
45 % correct answers, with 30% error margin
60 % correct answers, with 40% error margin
I currently use two different genetic algorithms:
The first is a basic selection, picking two random from my population, naming the one with best fitness the winner, and the other the loser. The loser receives one of the weights from the winner.
The second is mutation, where the loser from the selection receives a slight modification based on the amount of resulting errors. (the fitness is decided by correct answers and incorrect answers).
So if the network outputs a lot of errors, it will receive a big modification, where as if it has many correct answers, we are close to a acceptable goal and the modification will be smaller.
So to the question: What are ways I can prevent my ffann from getting stuck at local maxima?
Should I modify my current genetic algorithm to something more advanced with more variables?
Should I create additional mutation or crossover?
Or Should I maybe try and modify my mutation variables to something bigger/smaller?
This is a big topic so if I missed any information that could be needed, please leave a comment
Edit:
Tweaking the numbers of the mutation to a more suited value has gotten be a better answer rate but far from approved:
10% correct answers, with 1% error margin
33 % correct answers, with 10% error margin
43 % correct answers, with 20% error margin
65 % correct answers, with 30% error margin
73 % correct answers, with 40% error margin
The network is currently a very simple 3 layered structure with 3 inputs, 2 neurons in the only hidden layer, and a single neuron in the output layer.
The activation function used is Tanh, placing values in between -1 and 1.
The selection type crossover is very simple working like the following:
[a1, b1, c1, d1] // Selected as winner due to most correct answers
[a2, b2, c2, d2] // Loser
The loser will end up receiving one of the values from the winner, moving the value straight down since I believe the position in the array (of weights) matters to how it performs.
The mutation is very simple, adding a very small value (currently somewhere between about 0.01 and 0.001) to a random weight in the losers array of weights, with a 50/50 chance of being a negative value.
Here are a few examples of training data:
1, 8, -7 // the -7 represents + (1+8)
3, 7, -3 // -3 represents - (3-7)
7, 7, 3 // 3 represents * (7*7)
3, 8, 7 // 7 represents / (3/8)
Use a niching techniche in the GA. A useful alternative is niching. The score of every solution (some form of quadratic error, I think) is changed in taking account similarity of the entire population. This maintains diversity inside the population and avoid premature convergence an traps into local optimum.
Take a look here:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.100.7342
A common problem when using GAs to train ANNs is that the population becomes highly correlated
as training progresses.
You could try increasing mutation chance and/or effect as the error-change decreases.
In English. The population becomes genetically similar due to crossover and fitness selection as a local minim is approached. You can introduce variation by increasing the chance of mutation.
You can do a simple modification to the selection scheme: the population can be viewed as having a 1-dimensional spatial structure - a circle (consider the first and last locations to be adjacent).
The production of an individual for location i is permitted to involve only parents from i's local neighborhood, where the neighborhood is defined as all individuals within distance R of i. Aside from this restriction no changes are made to the genetic system.
It's only one or a few lines of code and it can help to avoid premature convergence.
References:
TRIVIAL GEOGRAPHY IN GENETIC PROGRAMMING (2005) - Lee Spector, Jon Klein
I have to check performance of various clustering algos using different performance operators in rapidminer. For that I want to know the following things:
what does cluster number index value shows which is output of cluster count performance operator?
what does small and large value of avg within cluster distance and avg. within centroid distance mean in terms of good and bad clustering?
I also want to check other indexes value like Dunn index,Jaccard index, Fowlkes–Mallows for various clustering algos. but rapidminer don't have any operator for this, what to do for that. I don't have experience with R.
I have copied part of the answer I gave on the Rapid-I forum
The cluster number index is the count of clusters - pointless you might say but when used with DBSCAN, it can be quite interesting http://rapidminernotes.blogspot.co.uk/2010/12/counting-clusters.html
The avg within cluster and centroid distances are hard to interpret - one thing to search for is "elbow criterion" in this context. As the number of clusters varies, note how the validity measure changes and look for an "elbow" that marks the point where the natural progression of the measure dominates the structure.
R has many validity measures and it's worth investing some time because you can always call the R process from RapidMiner which makes it easier to work out what is going on.
Is there a online version of the k-Means clustering algorithm?
By online I mean that every data point is processed in serial, one at a time as they enter the system, hence saving computing time when used in real time.
I have wrote one my self with good results, but I would really prefer to have something "standardized" to refer to, since it is to be used in my master thesis.
Also, does anyone have advice for other online clustering algorithms?
(lmgtfy failed ;))
Yes there is. Google failed to find it because it's more commonly known as "sequential k-means".
You can find two pseudo-code implementations of sequential K-means in this section of some Princeton CS class notes by Richard Duda. I've reproduced one of the two implementations below:
Make initial guesses for the means m1, m2, ..., mk
Set the counts n1, n2, ..., nk to zero
Until interrupted
Acquire the next example, x
If mi is closest to x
Increment ni
Replace mi by mi + (1/ni)*( x - mi)
end_if
end_until
The beautiful thing about it is that you only need to remember the mean of each cluster and the count of the number of data points assigned to the cluster. Once you update those two variables, you can throw away the data point.
I'm not sure where you would be able to find a citation for it. I would start looking in Duda's classic text Pattern Classification and Scene Analysis or the newer edition Pattern Classification. If it's not there, you could try Chris Bishop's newest book or Daphne Koller and Nir Friedman's recent text.