Nearest Neigborood using a confidence region - cluster-analysis

I have more than 1M data points and 32 of them (Orange in the pic) are my true class.
I would like to find similar blue points to the orange ones.
Feature vectors are just embeddings.
The approach that I took is to build a pseudo 95 confidence region and then flag the points within that area as my true label.
I think I cannot use a KNN algorithm for the following reasons:
I only know beforehand what points belong to the positive class.
KNN would be highly overfitted as I only have 32 positive data points over more than 1M dat points.
Is there any other algorithm or approach that suits better this problem?

Clustering very large data sets tend to grind to a halt. Here's a crazy idea. Can you take a random sample of the data set and work with that? If the selection process is totally random, it's just a subset of your full data set, and the smaller piece should be very representative of the full thing. It should be as simple as this.
subset = df.sample(frac=0.5)
See this link for more info.
https://towardsdatascience.com/how-to-sample-a-dataframe-in-python-pandas-d18a3187139b

Related

Looking for a suggested Clustering technique

I have a series (let's say 1000) of images of a biological sample...living cells. Over this series, the data for each pixel will describe a time variant "wave", if you will, giving the measure of light intensity vs time. After performing an FFT for this wave, I'll have the frequency content and phase for each pixel.
My goal is to be able to find all the pixels that are measuring a single cell, and was wondering if some sort of clustering technique would give me what I'm looking for. After some research (I know almost nothing of cluster analysis) looking at KMeans, DBSCAN, and a few others, I'm unsure how to proceed.
Here's my criteria:
a cluster should consist of connected pixels, with a maximum size of
around 9-12 pixels (this is defined by the actual size of the cell in
the field of view). Putting more pixels in a cluster likely means
that the cluster contains more than one cell, and I'd prefer each
cluster to represent a single cell.
the cells are signalling (glowing) with some frequency/phase. These are not necessarily in sync, so I think that this might be useful in segregating the cells/clusters.
there is an unknown number of cells in each image, so an unknown number of clusters.
the images are segmented into smaller, sub-images for analysis (the reason for this is not relevant here). These sub-images are to be analyzed separately for clusters. The sub-images are about 100 x 100 pixels.
Any suggestions would be greatly appreciated. I'm just looking for help getting pointed in the right direction.
Probably the most flexible is the classic old hierarchical agglomerative clustering (HAC). For some reason, people always overlook this powerful method, and prefer the much more limited kmeans.
HAC is very nice to parameterize. It needs a distance or similarity (little requirements here - probably should be symmetric, but no triangle inequality necessary). And with the linkage you can control the cluster shape or diameters nicely. For example, with complete linkage you can control the maximum diameter of a cluster. This is probably useful here, and my suggestion.
The main drawbacks of HAC are (1) scalability: at 50.000 instances it will be slow and use too much memory, and of course that (2) you need to know what you want to do: you need to choose distance, linkage, and cut the dendrogram. With k-means, you only need to choose k to get a (bad) result.
DBSCAN is a great algorithm, but in your case it is likely to form clusters with multiple cells. So I'd rather try OPTICS instead which may be able to discover substructures where DBSCAN only sees a large blob.

3D SIFT for human activity classification in videos. NOT GETTING GOOD ACCURACY.

I am trying to classify human activities in videos(six classes and almost 100 videos per class, 6*100=600 videos). I am using 3D SIFT(both xy and t scale=1) from UCF.
for f= 1:20
f
offset = 0;
c=strcat('running',num2str(f),'.mat');
load(c)
pix=video3Dm;
% Generate descriptors at locations given by subs matrix
for i=1:100
reRun = 1;
while reRun == 1
loc = subs(i+offset,:);
fprintf(1,'Calculating keypoint at location (%d, %d, %d)\n',loc);
% Create a 3DSIFT descriptor at the given location
[keys{i} reRun] = Create_Descriptor(pix,1,1,loc(1),loc(2),loc(3));
if reRun == 1
offset = offset + 1;
end
end
end
fprintf(1,'\nFinished...\n%d points thrown out do to poor descriptive ability.\n',offset);
for t1=1:20
des(t1+((f-1)*100),:)=keys{1,t1}.ivec;
end
f
end
My approach is to first get 50 descriptors(of 640 dimension) for one video, and then perform bag of words with all descriptors(on 50*600= 30000 descriptors). After performing Kmeans(with 1000 k value)
idx1000=kmeans(double(total_des),1000);
I am getting 30k of length index vector. Then I am creating histogram signature of each video based on their index values in clusters. Then perform svmtrain(sum in matlab) on signetures(dim-600*1000).
Some potential problems-
1-I am generating random 300 points in 3D to calculate 50 descriptors on any 50 points from those points 300 points.
2- xy, and time scale values, by default they are "1".
3-Cluster numbers, I am not sure that k=1000 is enough for 30000x640 data.
4-svmtrain, I am using this matlab library.
NOTE: Everything is on MATLAB.
Your basic setup seems correct especially given that you are getting 85-95% accuracy. Now, it's just a matter of tuning your procedure. Unfortunately, there is no way to do this other than testing a variety of parameters examining the results and repeating. I going to break this answer into two parts. Advice about bag of words features, and advice about SVM classifiers.
Tuning Bag of Words Features
You are using 50 3D SIFT Features per video from randomly selected points with a vocabulary of 1000 visual words. As you've already mentioned, the size of the vocabulary is one parameter you can adjust. So is the number of descriptors per video.
Let's say that each video is 60 frames long, (at 30 fps only 2 sec, but let's assume you are sampling at 1fps for a 1 minute video). That means you are capturing less than one descriptor per frame. That seems very low to me even with 3D descriptors especially if the locations are randomly chosen.
I would manually examine the points for which you are generating features. Do they appear be well distributed in both space and time? Are you capturing too much background? Ask yourself, would I be able to distinguish between actions given these features?
If you find that many of the selected points are uninformative, increasing the number of points may help. The kmeans clustering can make a few groups for uninformative outliers, and more points means you hopefully capture a few more informative points. You can also try other methods for selecting points. For example, you could use corner points.
You can also manually examine the points that are clustered together. What sorts of structures do the groups have in common? Are the clusters too mixed? That's usually a sign that you need a larger vocabulary.
Tuning SVMs
Using the Matlab SVM implementation or the Libsvm implementation should not make a difference. They are both the same method and have similar tuning options.
First off, you should really be using cross-validation to tune the SVM to avoid overfitting on your test set.
The most powerful parameter for the SVM is the kernel choice. In Matlab, there are five built in kernel options, and you can also define your own. The kernels also have parameters of their own. For example, the gaussian kernel has a scaling factor, sigma. Typically, you start off with a simple kernel and compare to more complex kernels. For example, start with linear, then test quadratic, cubic and gaussian. To compare, you can simply look at your mean cross-validation accuracy.
At this point, the last option is to look at individual instances that are misclassified and try to identify reasons that they may be more difficult than others. Are there commonalities such as occlusion? Also look directly at the visual words that were selected for these instances. You may find something you overlooked when you were tuning your features.
Good luck!

When to use k means clustering algorithm?

Can I use k-means algorithm for a single attribute?
Is there any relationship between the attributes and the number of clusters?
I have one attribute's performance, and I want to classify the data into 3 clusters: poor, medium, and good.
Is it possible to create 3 clusters with one attribute?
K-Means is useful when you have an idea of how many clusters actually exists in your space. Its main benefit is its speed. There is a relationship between attributes and the number of observations in your dataset.
Sometimes a dataset can suffer from The Curse of Dimensionality where your number of variables/attributes is much greater than your number of observations. Basically, in high dimensional spaces with few observations, it becomes difficult to separate observations in hyper dimensions.
You can certainly have three clusters with one attribute. Consider the quantitative attribute in which you have 7 observations
1
2
100
101
500
499
501
Notice there are three clusters in this sample centered: 1.5, 100.5, and 500.
If you have one dimensional data, search stackoverflow for better approaches than k-means.
K-means and other clustering algorithms shine when you have multivariate data. They will "work" with 1-dimensional data, but they are not very smart anymore.
One-dimensional data is ordered. If you sort your data (or it even is already sorted), it can be processed much more efficiently than with k-means. Complexity of k-means is "just" O(n*k*i), but if your data is sorted and 1-dimensional you can actually improve k-means to O(k*i). Sorting comes at a cost, but there are very good sort implementations everywhere...
Plus, for 1-dimensional data there is a lot of statistics you can use that are not very well researched or tractable on higher dimensions. One statistic you really should try is kernel density estimation. Maybe also try Jenks Natural Breaks Optimization.
However, if you want to just split your data into poor/medium/high, why don't you just use two thresholds?
As others have answered already, k-means requires prior information about the count of clusters. This may appear to be not very helpful at the start. But, I will cite the following scenario which I worked with and found to be very helpful.
Color segmentation
Think of a picture with 3 channels of information. (Red, Green Blue) You want to quantize the colors into 20 different bands for the purpose of dimensional reduction. We call this as vector quantization.
Every pixel is a 3 dimensional vector with Red, Green and Blue components. If the image is 100 pixels by 100 pixels then you have 10,000 vectors.
R,G,B
128,100,20
120,9,30
255,255,255
128,100,20
120,9,30
.
.
.
Depending on the type of analysis you intend to perform, you may not need all the R,G,B values. It might be simpler to deal with an ordinal representation.
In the above example, the RGB values might be assigned a flat integral representation
R,G,B
128,100,20 => 1
120,9,30 => 2
255,255,255=> 3
128,100,20 => 1
120,9,30 => 2
You run the k-Means algorithm on these 10,000 vectors and specify 20 clusters. Result - you have reduced your image colors to 20 broad buckets. Obviously some information is lost. However, the intuition for this loss being acceptable is that when the human eyes is gazing out over a patch of green meadow, we are unlikely to register all the 16 million RGB colours.
YouTube video
https://www.youtube.com/watch?v=yR7k19YBqiw
I have embedded key pictures from this video for your understanding. Attention! I am not the author of this video.
Original image
After segmentation using K means
Yes it is possible to use clustering with single attribute.
No there is no known relation between number of cluster and the attributes. However there have been some study that suggest taking number of clusters (k)=n\sqrt{2}, where n is the total number of items. This is just one study, different study have suggested different cluster numbers. The best way to determine cluster number is to select that cluster number that minimizes intra-cluster distance and maximizes inter-cluster distance. Also having background knowledge is important.
The problem you are looking with performance attribute is more a classification problem than a clustering problem
Difference between classification and clustering in data mining?
With only one attribute, you don't need to do k-means. First, I would like to know if your attribute is numerical or categorical.
If it's numerical, it would be easier to set up two thresholds. And if it's categorical, things are getting much easier. Just specify which classes belong to poor, medium or good. Then simple data frame operations would be working.
Feel free to send me comments if you are still confused.
Rowen

Clustering on non-numeric dimensions

I recently started working on clustering and k-means algorithm and was trying to come up with a good use case and solve it.
I have the following data about the items sold in different cities.
Item City
Item1 New York
Item2 Charlotte
Item1 San Francisco
...
I would like to cluster the data based on variables city and item to find groups of cities that might have similar patterns for the items sold.The problem is the k-means I use do not accept non-numeric input. Any idea how should I proceed with this to find a meaningful solution.
Thanks
SV
Clustering requires a distance definition. A cluster is only a cluster if the items are "closer" according to some distance function. The closer they are, the more likely they belong to the same cluster.
In your case, you can try to cluster based on various data related to the cities, like their geographical coordinates, or demographic informations, and see if the clusters overlap in the various cases !
In order for k-means to produce usable results, the means must be meaningful.
Even if you would e.g. use binary vectors, k-means on these would not make a lot of sense IMHO.
Probably the best use case to get started with k-means is color quantization. Take a picture, and use the RGB values of every pixel as 3d vectors. Then run k-means with k as the desired number of colors. The color centers are your final palette, and every pixel will be mapped to the closest center for color reduction.
The reason why this works well with k-means are twofold:
the mean actually makes sense for finding the mean color of multiple pixels
the axes R, G and B have a similar meaning and scale, so there is no bias
If you want to step beyond, try to do the same e.g. in HSB space. And you'll run into difficulties if you want it to be really good. Because the hue value is cyclic, which is inconcistent with the mean. Assuming the hue is on 0-360 degrees, then the "mean" hue of "1" and "359" is not 180 degrees, but 0. So on this data, k-means results will be suboptimal.
See e.g. https://en.wikipedia.org/wiki/Color_quantization for details as well as the two dozen k-means questions here with respect to sparse and binary data.
You may still need to abstractly represent your data in numerical form. This May Help
http://www.analyticbridge.com/forum/topics/clustering-with-non-numeric?commentId=2004291%3AComment%3A40805
Try to re-analyze the problem again and Understand if there is any relationship that you can take advantage of and represent in numerical form.
I worked on a Project where I had to represent Colors by their RGB values. It worked preety good.
Hope this helps

Measuring density for three dimensional data (in Matlab)

I have a dataset consisting of a large collection of points in three dimensional euclidian space. In this collection of points, i am trying to find the point that is nearest to the area with the highest density of points.
So my problem consists of two steps:
1: Determine where density of the distribution of points is at its highest
2: Determine which point is nearest to the point found in 1
Point 2 i can manage, but i'm not sure how to solve point 1. I know there are a lot of functions for density estimation in Matlab, but i'm not sure which one would be the most suitable, or straightforward to use.
Does anyone know?
My command of statistics is a little bit rusty, but as far as i can tell, this type of problem calls for multivariate analysis. Someone suggested i use multivariate kernel density estimation, but i'm not really sure if that's the best solution.
Density is a measure of mass per unit volume. On the assumption that your points all have the same mass then you are, I suppose, trying to measure the number of points per unit volume. So one approach is to divide your subset of Euclidean space into lots of little unit volumes (let's call them voxels like everyone does) and count how many points there are in each one. The voxel with the most points is where the density of points is at its highest. This is, of course, numerical integration of a sort. If your points were distributed according to some analytic function (and I guess they are not) you could solve the problem with pencil and paper.
You might make this approach as sophisticated as you like, perhaps initially dividing your space into 2 x 2 x 2 voxels, then choosing the voxel with most points and sub-dividing that in turn until your criteria are satisfied.
I hope this will get you started on your point 1; you seem to be OK with point 2 so I'll stop now.
EDIT
It looks as if triplequad might be what you are looking for.