Clustering\Grouping Challenge - clustering pairs into groups - cluster-analysis

I have a clustering challenge ...
I have many pairs of data (e.g. A<-->B, C<-->D, E<-->F, A<-->F and so on)
I need to group\cluster them into N groups, e.g. Group#1: A,B,F Group #2: C,D.
The clustering shall be done using the given pairs association (i.e. A and B are paired)
Any idea? I'm rather sure there are algorithms for that, but not sure how to look for them.

Related

Consistent hashing, why are Vnodes a thing?

My understanding of consistent hashing is that you take a key space, hash the key and then mod by say 360, and place the values in a ring. Then you equally space nodes on that ring. You pick the node to handle this key by looking clockwise from where your hashed key landed.
Then in many explanation they go onto describe Vnodes. In the riak docs which refers to the dynamo paper they say:
The basic consistent hashing algorithm presents some challenges. First, the random position assignment of each node on the ring leads to non-uniform data and load distribution.
Then they go on to propose Vnodes as a way to ensure uniform distribution of the input key space around the ring. The gist as I understand is that Vnodes divide up the ranges many more times than you have machines. So say you have 10 machines you might have 100 Vnodes and an individual machines Vnodes would be scattered randomly around the ring.
Now my question is why is this extra Vnode step required. Hash functions are supposed to provide a uniform distribution of their output so it would seem this is unneeed. According to this answer even the modulo of a hash function is still uniformly distributed.
Imo, the missing piece of key information with most explanations of consistent hashing is that they don't detail the part about "multiple hash functions."
For whatever reason, most "consistent hashing for dummies" articles gloss over the implementation detail that makes the virtual nodes work with random distribution.
Before talking about how it does work, let me clarify the above with an example of how it does not work.
How it does not work
A naive implementation of vnodes looks like this:
source
This is naive because you'll notice that, for example, the green vnode always precedes the blue vnode. This means that if the green vnode goes offline then it will be replaced solely by the blue vnode, which defeats the entire purpose of moving from single-token nodes to distributed virtual nodes.
The article quickly mentions that practically, Vnodes are randomly distributed across the cluster. It then shows a separate picture indicating this but without explaining the mechanics of how this is achieved.
How it does work
Random distribution of vnodes is achieved via the use of multiple, unique hash functions. These multiple functions are where the random distribution comes from.
This makes the implementation look something roughly like this:
A) Ring Formation
You have a ring consisting of n physical nodes via physical_nodes = ['192.168.1.1', '192.168.1.2', '192.168.1.3', '192.168.1.4']; (think of this as B/R/P/G in the prior picture's left-side)
You decide to distribute each physical node into k "virtual slices," i.e. a single physical node is sliced into k pieces
In this example, we use k = 4, but in practice we use should use k ≈ log2(num_items) to obtain reasonably balanced loads for storing a total of num_items in the entire datastore
This means that num_virtual_nodes == n * k; (this corresponds to the 16 pieces in the prior picture's right-side)
Assign a unique hashing algo for each k via hash_funcs = [md5, sha, crc, etc]
(You can also use a single function that is recursively called k times)
Divy up the ring by the following:
virtual_physical_map = {}
virtual_node_ids = []
for hash_func in hash_funcs:
for physical_node in physical_nodes:
virtual_hash = hash_func(physical_node)
virtual_node_ids.append(virtual_hash)
virtual_physical_map[virtual_hash] = physical_node
virtual_node_ids.sort()
You now have a ring composed of n * k virtual nodes, which are randomly distributed across the n physical nodes.
B) Partition Request Flow
A partition-request is made with a provided key_tuple to key off of
The key_tuple is hashed to get key_hash
Find the next clock-wise node via virtual_node = binary_search(key_hash, virtual_node_ids)
Lookup the real node via physical_id = virtual_physical_map[virtual_node]
Page 6 of this Stanford Lecture was very helpful to me in understanding this.
The net effect is that the distribution of vnodes across the ring looks like this:
source
First, the random position assignment of each node on the ring leads to non-uniform data and load distribution.
Good hash functions provide uniform distribution, but the input also had to be sufficiently large in number for them to appear spread out. The keys are, but the servers aren't. So a million keys that are hashed and modulo'd by 360 will be evenly distributed around the ring, but if you only use say 3 servers S1 through S3 to hold the key-value pairs, there is no guarantee that they might be hashed (with the same hash function used for the keys) uniformly on the ring at positions 0, 120 and 240. S1 might hash at 10, S2 at 12 and S3 at 50. So S2 will hold very less KV pairs compared to the other two. By having virtual servers, you increase the chances of them being hashed uniformly around the ring.
The other benefit is the even re-distribution of keys when a server is added or removed as mentioned in the doc.

Determine number of clusters for different datasets

I performed a clustering analysis of the media usage of different users in order to find different groups that use a specific set of media (e.g. group 1 use media A, B and C and group 2 use media B, C and D). Then I divided the datset in different groups, since the users belong to a specific group (as a consequence the original dataset and the new datasets have a different size). Within in this groups I like to cluster again which different media sets are used.
How can I determine the number of clusters to guarantee that the results are comparable?
Thank you in advance!
Don't rely on clustering to be stable.
It's a hypothesis generation tool.
You clustered, and now you have the hypothesis that there are groups ABCD of media usage. You should first evaluate if this hypothesis is adequate. Now what you want to do in your next step is to assign the labels to subsets of the data. First of all, you should be able to simply subset this from the previous labels. But if this really is different data, you can label new data, for example using the most similar record (nearest neighbor classification). But that is classification now, because your classes are fixed.

How to cluster sets (users/documents) with distributed MinHash using the banding technique?

I have a big doubt about the way I should cluster sets using MinHash together with the banding technique.
I assume everyone reading has a good knowledge of MinHash so I won't define most of the terms I'm using.
My goal is to use MinHash to cluster users according to the similarity of their signatures. In a local, non-banded settings this would be trivial: if their signature hash is the same, they go in the same cluster.
If we split signatures in bands and process them indipendently, I can treat a band as I said before and generate a group of clusters for every band. My question is: how should I aggregate these clusters? Just merge them if they have at least an element in common? Or should I do something different?
Thanks
MinHash is not really meant as standalone clustering algorithm. It is meant as a candidate filter for near-duplicate detection.
When looking for similar documents, you compute the minhashes to retrieve candidates. You then still need to check these candidates - they could be false positives!
The more signatures agree, the more likely they really match.
So if you consider the near-duplicate scenario again: if a is a near duplicate of b and b is a near duplicate of c, then a should also be a near duplicate of c. If this holds, you can throw all these matches (after verification) together. If it doesn't consider a hierarchical clustering like strategy to merge (or not merge) candidates.

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.

Clustering/Nearest Neighbor

I have thousands to ten-thousands of data points (x,y)coming from 5 to 6 different source. I need to uniquely group them based on certain distance criteria in such a way that the formed group should exactly contain only one input from each source and each of them in the group should be within certain distance d. The groups formed should be the best possible match.
Is this a combination of clustering and nearest neighbor?
What are the recommendation for the algorithms?
Are there any open source available for it?
I see many references saying KD tree implementation and k-clustering etc. I am not sure how can I tailor to this specific need.