Clustering/Nearest Neighbor - cluster-analysis

I have thousands to ten-thousands of data points (x,y)coming from 5 to 6 different source. I need to uniquely group them based on certain distance criteria in such a way that the formed group should exactly contain only one input from each source and each of them in the group should be within certain distance d. The groups formed should be the best possible match.
Is this a combination of clustering and nearest neighbor?
What are the recommendation for the algorithms?
Are there any open source available for it?
I see many references saying KD tree implementation and k-clustering etc. I am not sure how can I tailor to this specific need.

Related

Determine number of clusters for different datasets

I performed a clustering analysis of the media usage of different users in order to find different groups that use a specific set of media (e.g. group 1 use media A, B and C and group 2 use media B, C and D). Then I divided the datset in different groups, since the users belong to a specific group (as a consequence the original dataset and the new datasets have a different size). Within in this groups I like to cluster again which different media sets are used.
How can I determine the number of clusters to guarantee that the results are comparable?
Thank you in advance!
Don't rely on clustering to be stable.
It's a hypothesis generation tool.
You clustered, and now you have the hypothesis that there are groups ABCD of media usage. You should first evaluate if this hypothesis is adequate. Now what you want to do in your next step is to assign the labels to subsets of the data. First of all, you should be able to simply subset this from the previous labels. But if this really is different data, you can label new data, for example using the most similar record (nearest neighbor classification). But that is classification now, because your classes are fixed.

Data mining methods

I would like to know which data mining methods (regression, assosiation, clustring or classification) I have to use in case if I would like to find the highest number of reviews among several apps categories.
Thanks in advance for any support.
None.
Finding the maximum is not data mining. It's a simple for loop.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.

How can I write a logical process for finding the area of a point on a graph?

I have the following graph with 2 different parameters called p and t. 
Their relationship is experimentally found. Manually by knowing (t,p), you can simply find the area number (group) of the point based on where it is located. For example, point M(t,p), locates in area 3 and belongs to group number 3. However, I would like to write a code/logical approach which automatically finds the group numbers. therefore when it reads (t,p) it will find the location of the point and give the group/Area number it belongs.
Is there any solution in Matlab for this scope?  Graph
If you have the Image Processing Toolbox and your contours are closed, you can use imfill to fill them up (a bit like the bucket tool in Paint) and assign different values to each filled up region. Does this make sense to you? Let me know if you would like more detail.
Marta

Unsupervised Anomaly Detection with Mixed Numeric and Categorical Data

I am working on a data analysis project over the summer. The main goal is to use some access logging data in the hospital about user accessing patient information and try to detect abnormal accessing behaviors. Several attributes have been chosen to characterize a user (e.g. employee role, department, zip-code) and a patient (e.g. age, sex, zip-code). There are about 13 - 15 variables under consideration.
I was using R before and now I am using Python. I am able to use either depending on any suitable tools/libraries you guys suggest.
Before I ask any question, I do want to mention that a lot of the data fields have undergone an anonymization process when handed to me, as required in the healthcare industry for the protection of personal information. Specifically, a lot of VARCHAR values are turned into random integer values, only maintaining referential integrity across the dataset.
Questions:
An exact definition of an outlier was not given (it's defined based on the behavior of most of the data, if there's a general behavior) and there's no labeled training set telling me which rows of the dataset are considered abnormal. I believe the project belongs to the area of unsupervised learning so I was looking into clustering.
Since the data is mixed (numeric and categorical), I am not sure how would clustering work with this type of data.
I've read that one could expand the categorical data and let each category in a variable to be either 0 or 1 in order to do the clustering, but then how would R/Python handle such high dimensional data for me? (simply expanding employer role would bring in ~100 more variables)
How would the result of clustering be interpreted?
Using clustering algorithm, wouldn't the potential "outliers" be grouped into clusters as well? And how am I suppose to detect them?
Also, with categorical data involved, I am not sure how "distance between points" is defined any more and does the proximity of data points indicate similar behaviors? Does expanding each category into a dummy column with true/false values help? What's the distance then?
Faced with the challenges of cluster analysis, I also started to try slicing the data up and just look at two variables at a time. For example, I would look at the age range of patients accessed by a certain employee role, and I use the quartiles and inter-quartile range to define outliers. For categorical variables, for instance, employee role and types of events being triggered, I would just look at the frequency of each event being triggered.
Can someone explain to me the problem of using quartiles with data that's not normally distributed? And what would be the remedy of this?
And in the end, which of the two approaches (or some other approaches) would you suggest? And what's the best way to use such an approach?
Thanks a lot.
You can decide upon a similarity measure for mixed data (e.g. Gower distance).
Then you can use any of the distance-based outlier detection methods.
You can use k-prototypes algorithm for mixed numeric and categorical attributes.
Here you can find a python implementation.

Clustering\Grouping Challenge - clustering pairs into groups

I have a clustering challenge ...
I have many pairs of data (e.g. A<-->B, C<-->D, E<-->F, A<-->F and so on)
I need to group\cluster them into N groups, e.g. Group#1: A,B,F Group #2: C,D.
The clustering shall be done using the given pairs association (i.e. A and B are paired)
Any idea? I'm rather sure there are algorithms for that, but not sure how to look for them.