Determine number of clusters for different datasets - cluster-analysis

I performed a clustering analysis of the media usage of different users in order to find different groups that use a specific set of media (e.g. group 1 use media A, B and C and group 2 use media B, C and D). Then I divided the datset in different groups, since the users belong to a specific group (as a consequence the original dataset and the new datasets have a different size). Within in this groups I like to cluster again which different media sets are used.
How can I determine the number of clusters to guarantee that the results are comparable?
Thank you in advance!

Don't rely on clustering to be stable.
It's a hypothesis generation tool.
You clustered, and now you have the hypothesis that there are groups ABCD of media usage. You should first evaluate if this hypothesis is adequate. Now what you want to do in your next step is to assign the labels to subsets of the data. First of all, you should be able to simply subset this from the previous labels. But if this really is different data, you can label new data, for example using the most similar record (nearest neighbor classification). But that is classification now, because your classes are fixed.

Related

Legal Hierarchical Quorums in Zookeeper

I am trying to understand hierarchical quorums in Zookeeper. I may not understand the example shown in the documentation (here). Are votes [from at least two servers from each of two different groups] enough to form a legal quorum?
In my opinion, the example here does not gain the majority of all the weight; it only gains more than 4 ballots. A legal quorum should earn more than 5 ballots (9/2+1).
I also read the source code. The algorithm implementation is shown from line 352 to line 371. Zookeeper only checks if all groups have a majority and if the number of selected groups is larger than half of the group number.
Maybe I find the answer.
A different construction that uses weights and is useful in wide-area deployments (co-locations) is a hierarchical one. With this construction, we split the servers into disjoint groups and assign weights to processes. To form a quorum, we have to get a hold of enough servers from a majority of groups G, such that for each group g in G, the sum of votes from g is larger than half of the sum of weights in g. Interestingly, this construction enables smaller quorums. If we have, for example, 9 servers, we split them into 3 groups, and assign a weight of 1 to each server, then we are able to form quorums of size 4.
Note that two subsets of processes composed each of a majority of servers from each of a majority of groups necessarily have a non-empty intersection. It is reasonable to expect that a majority of co-locations will have a majority of servers available with high probability.

Building propensity score for a cluster

I am working on an exercise to build influencer score for each user in my data set. What that means is that a user with higher engagement should get higher score and vice versa. however, i have many different type of engagement variables and i am not sure which one should weight higher.
so, i first did a cluster analysis to divide users into different group based on engagement activity using 5 different types of engagement. Based on this, i found that one of the cluster has high level of engagement across all the different types of engagement variables. This is the group i am interested in. however, it is possible that the group size i get may be smaller than the number of users i want to use in future. so, i want to now use these clusters and create a propensity score.
e.g. in the cluster analysis, say i get 5 clusters c1, c2,c3,c4,c5 and c5 is my cluster of interest. so, i give all users in c5 a value of 1 (= influencer) and i give all users in c1 to c4 a value of 0 (= not influencer). now, i use this binary variable and build a logistic regression model (using same engagement variables as used for clustering) to get propensity for everyone to an influencer. this way, i can change the threshold to reduce or increase the numbers of users i want to select.
Now, the issue i am running in is that one of the engagement variable is able to predict influencer very well and hence my propensity scores are very close to either 1 or 0 which defeats the purpose of why i wanted the propensity score in the first place.
S0, 2 questions -
1) is this approach of building a unsupervised classification and then using this to build supervised classification a sound approach of what i am trying to do?
2) how do i reduce the contribution from the variable that predicts influencer really well to ensure that i get much more smoother curve instead of values near 0 or 1. i don't want to remove this variable from the model as this is important from business perspective.

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.

Clustering/Nearest Neighbor

I have thousands to ten-thousands of data points (x,y)coming from 5 to 6 different source. I need to uniquely group them based on certain distance criteria in such a way that the formed group should exactly contain only one input from each source and each of them in the group should be within certain distance d. The groups formed should be the best possible match.
Is this a combination of clustering and nearest neighbor?
What are the recommendation for the algorithms?
Are there any open source available for it?
I see many references saying KD tree implementation and k-clustering etc. I am not sure how can I tailor to this specific need.

Clustering\Grouping Challenge - clustering pairs into groups

I have a clustering challenge ...
I have many pairs of data (e.g. A<-->B, C<-->D, E<-->F, A<-->F and so on)
I need to group\cluster them into N groups, e.g. Group#1: A,B,F Group #2: C,D.
The clustering shall be done using the given pairs association (i.e. A and B are paired)
Any idea? I'm rather sure there are algorithms for that, but not sure how to look for them.