Heavily unbalanced/skewed data clusters - cluster-analysis

I am facing some issues with my k-means clustering results on Alteryx. I am trying to conduct topic modelling on my data set of around 5000 text descriptions. After data cleaning, parsing and removing stop words and common words, I created a Document Term Matrix of 20 words and around 5000 documents.
After running K-Means Clustering on Alteryx, no matter how many clusters I indicated, there will always be only 1 document in all clusters except one with all the rest. For example:
2 Clusters
Cluster 1: 19 words
Cluster 2: 1 word
3 Clusters
Cluster 1: 18 words
Cluster 2: 1 word
Cluster 3: 1 word
5 Clusters
Cluster 1: 16 words
Cluster 2: 1 word
Cluster 3: 1 word
Cluster 4: 1 word
Cluster 5: 1 word
This clustering behavior happens no matter how many clusters I indicated. Looking for some help to shed some light and identify if these results would mean my data has problems or if I did not use the correct settings?
Thanks in advance!

Did you look at your data after preprocessing?
Probably many documents are now empty, or contain just one word.
The is not much left except finding the common words.

Related

Legal Hierarchical Quorums in Zookeeper

I am trying to understand hierarchical quorums in Zookeeper. I may not understand the example shown in the documentation (here). Are votes [from at least two servers from each of two different groups] enough to form a legal quorum?
In my opinion, the example here does not gain the majority of all the weight; it only gains more than 4 ballots. A legal quorum should earn more than 5 ballots (9/2+1).
I also read the source code. The algorithm implementation is shown from line 352 to line 371. Zookeeper only checks if all groups have a majority and if the number of selected groups is larger than half of the group number.
Maybe I find the answer.
A different construction that uses weights and is useful in wide-area deployments (co-locations) is a hierarchical one. With this construction, we split the servers into disjoint groups and assign weights to processes. To form a quorum, we have to get a hold of enough servers from a majority of groups G, such that for each group g in G, the sum of votes from g is larger than half of the sum of weights in g. Interestingly, this construction enables smaller quorums. If we have, for example, 9 servers, we split them into 3 groups, and assign a weight of 1 to each server, then we are able to form quorums of size 4.
Note that two subsets of processes composed each of a majority of servers from each of a majority of groups necessarily have a non-empty intersection. It is reasonable to expect that a majority of co-locations will have a majority of servers available with high probability.

Find "complemented" bit vectors clusters

I have a huge list of bit vectors (BV) that I want to group in clusters.
The idea behind this clusters is to be able to choose later BVs from each cluster and combine them for generate a BV with (almost) all-ones (which must be maximized).
For example, imagine the 1 means an app is Up and 0 is down in node X in a specific moment in time. We want to find the min list of nodes for having the app Up:
App BV for node X in cluster 1: 1 0 0 1 0 0
App BV for node Y in cluster 2: 0 1 1 0 1 0
Combined BV for App (X+Y): 1 1 1 1 1 0
I have been checking the different cluster algorithms but I did found one that takes into account this "complemental" behavior because in this case each column of the BV is not referred to a feature (only means up or down in an specific timeframe).
Regarding other algorithms like k-means or hierarchical clustering, I do not have clear if I can include in the clustering algorithm this consideration for the later grouping.
Finally, I am using the hamming distance to determine the intra-cluster and the inter-cluster distances given that it seems to be the most appropiated metric for binary data but results show me that clusters are not closely grouped and separated among them so I wonder if I am applying the most suitable group/approximation method or even if I should filter the input data previously grouping.
Any clue or idea regarding grouping/clustering method or filtering data is welcomed.
This does not at all sound like a clustering problem.
None of these algorithms will help you.
Instead, I would rather call this a match making algorithm. But I'd assume it is at least NP-hard (it resembles set cover) to find the true optimum, so you'll need to come up with a fast approximation. Best something specific to your use case.
Also you haven't specified (you wrote + but that likely isn't what you want) how to combine two 1s. Is it xor or or? Nor if it is possible to combine more than two, and what the cost is when doing so. A strategy would be to find the nearest neighbor of the inverse bitvector for each and always combine the best pair.

Why does Weka XMeans systematically underestimate the number of clusters?

I'm having some trouble with Weka's XMeans clusterer. I've talked to a couple of other humans and we all agree that there are six clusters in the screenshot below, or at least a minimum of two if you squint really hard. Either way, xMeans does not seem to agree.
XMeans seems to systematically underestimate the number of clusters, based on whatever I have the minimum cluster count set to. With the maximum number of clusters held at 100, here are the results I get:
-L 1 // 1 cluster
-L 2 // 2 clusters
-L 3 // 3 clusters
-L 4 // 5 clusters
-L 5 // 6 clusters
-L 6 // 6 clusters
Most egregiously, with -L 1 (and -H 100) only a single cluster is found. Only by getting the minimum cluster count to five do I actually see the six clusters. Cranking the improve-structure parameter way up (to 100,000) does not seem to have any effect. (I've also played with other options without seeing any difference.) Here are the options that generated the above screenshot, which found one center:
private static final String[] XMEANS_OPTIONS = {
"-H", "100", // max number of clusters (default 4)
"-L", "1", // min number of clusters (default 2)
"-I", "100", // max overall iterations (default 1)
"-M", "1000000", // max improve-structure iterations (default 1000)
"-J", "1000000", // max improve-parameter iterations (default 1000)
};
Obviously I'm missing something here. How do I make XMeans behave as expected?
I think it is what I was afraid of. A psychological problem (I think if you search for Gestalttheorie you might find some explanation on the perception aspect).
The human eye is grasping the form of the clusters and finds six circles. However, k-means and therefore x-means only looks at the distance. Hence, the cluster look rather awkward. Also after multiple restarts with 6-means I almost always achieved clusters like:
This one might be a local minima which might be solved by xmeans
or for 3-means
Which is quite interesting as some points clearly violate the expectations.
If you use the K-Means in R you can analyse the withinss of the cluster result. These show that these awkward looking cluster typically perform quite well. Hence, there is no convergence to a different result to be expected.
I think this could be resolvable by using a different distance measure. For example the squared euclidean distance which enforce circle like shapes.
Or using some kernel-based clustering technique with a RBF Kernel
===============================================
EDIT1:
However, one aspect of the results of weka are still rather awkward, I used RWeka to run a few experiments. Basically I did run 100 cluster-runs for each initial cluster size between 2 and 7. My expectation was that for 2 and 6 the clusters are rather stable for the other cluster sizes I would have expected them to grow.
However, the results are differently
So basically 2 and 6 are fairly stable, however, the cluster sizes are always expended by at most 1.
So lets have a look at the BIC
What we can observe is that the BIC is not increasing when increasing the cluster size, however, strongly dependent on the initial cluster size.
Let's drill down a bit further and look at initial cluster size of 3. Running multiple restarts with different initial sizes generates (reproducible) the following two situations:
Nevertheless the results on the BIC seem to suggest that there is a BUG in the BIC calculation.

Hierarchical quorums in Zookeeper

I am trying to understand hierarchical quorums in Zookeeper. The documentation here
gives an example but I am still not quiet sure I understand it. My question is, if I have a two node Zookeeper cluster (I know it is not recommended but let's consider it for the sake of this example)
server.1 and
server.2,
can I have hierarchical quorums as follows:
group.1=1:2
weight.1=2
weight.2=2
With the above configuration:
Even if one node goes down I still have enough votes (?) to
maintain a quorum ? is this a correct statement ?
What is the zookeeper quorum value here (2 - for two nodes or 3 -
for 4 votes)
In a second example, say I have:
group.1=1:2
weight.1=2
weight.2=1
In this case if server.2 goes down,
Should I still have sufficient votes (2) to maintain a quorum ?
As far as I understand from the documentation, When we give weight to a node, then the majority varies from being the number of nodes. For example, if there are 10 nodes and 3 of the nodes have been given 70 percent of weightage, then it is enough to have those three nodes active in the network. Hence,
You don't have enough majority since both nodes have equal weight of 2. So, if one node goes down, we have only 50 percent of the network being active. Hence quorum is not achieved.
Since total weight is 4. we require 70 percent of 4 which would be 2.8 so closely 3, since we have only two nodes, both needs to be active to meet the quorum.
In the second example, it is clear from the weights given that 2/3 of the network would be enough (depends on the configuration set by us, I would assume 70 percent always,) if 65 percent is enough to say that network is alive, then the quorum is reached with one node which has weightage 2.

Fuzzy clustering without number of clusters

I'm looking for fuzzy clustering algorithm which does not need specified number of clusters. I used hierarchical clustering but it gives results of "hard" clusters. I need something similar but with possibility that one element can be in more than one cluster.
I searched google about it some days ago, it is HFRECCA. I know about FRECCA, but i don't know details about HFRECCA. You can get it from here- http://www.researchgate.net/publication/261081980_Text_Clustering_Using_HFRECCA_and_Rough_K-Means_Clustering_Algorithm