K-modes clusters evolvings - cluster-analysis

After performing a clustering of a dataset using k-modes, I have to evolve the clusters in time so, is there a way to automatically adjust the centroids as long as the data points changes its property values?.
I mean. I am clustering a big set of data with categorical values. However, these data points change in time (its categorical values) so I want to know if there is any way to make adjustments on the K centroids (or even in the K number) as long as the data points are slightly changing over time. I can recalculate the distance from each data point with the centroids and move the data point to another cluster but this would consider the centroids as fixed and I guess they could also change as data points change.
Re-Clustering is a very heavy task in time so there is a need to make this adjustments in a more efficient way. I am searching on the literature but I havent found any information about it.
Anybody knows if this is possible or any study related to this?

Rather than reclustering, use the previously found centers.
They shouldn't change much, and converge quickly.

Related

Looking for a suggested Clustering technique

I have a series (let's say 1000) of images of a biological sample...living cells. Over this series, the data for each pixel will describe a time variant "wave", if you will, giving the measure of light intensity vs time. After performing an FFT for this wave, I'll have the frequency content and phase for each pixel.
My goal is to be able to find all the pixels that are measuring a single cell, and was wondering if some sort of clustering technique would give me what I'm looking for. After some research (I know almost nothing of cluster analysis) looking at KMeans, DBSCAN, and a few others, I'm unsure how to proceed.
Here's my criteria:
a cluster should consist of connected pixels, with a maximum size of
around 9-12 pixels (this is defined by the actual size of the cell in
the field of view). Putting more pixels in a cluster likely means
that the cluster contains more than one cell, and I'd prefer each
cluster to represent a single cell.
the cells are signalling (glowing) with some frequency/phase. These are not necessarily in sync, so I think that this might be useful in segregating the cells/clusters.
there is an unknown number of cells in each image, so an unknown number of clusters.
the images are segmented into smaller, sub-images for analysis (the reason for this is not relevant here). These sub-images are to be analyzed separately for clusters. The sub-images are about 100 x 100 pixels.
Any suggestions would be greatly appreciated. I'm just looking for help getting pointed in the right direction.
Probably the most flexible is the classic old hierarchical agglomerative clustering (HAC). For some reason, people always overlook this powerful method, and prefer the much more limited kmeans.
HAC is very nice to parameterize. It needs a distance or similarity (little requirements here - probably should be symmetric, but no triangle inequality necessary). And with the linkage you can control the cluster shape or diameters nicely. For example, with complete linkage you can control the maximum diameter of a cluster. This is probably useful here, and my suggestion.
The main drawbacks of HAC are (1) scalability: at 50.000 instances it will be slow and use too much memory, and of course that (2) you need to know what you want to do: you need to choose distance, linkage, and cut the dendrogram. With k-means, you only need to choose k to get a (bad) result.
DBSCAN is a great algorithm, but in your case it is likely to form clusters with multiple cells. So I'd rather try OPTICS instead which may be able to discover substructures where DBSCAN only sees a large blob.

Is there maximum number of noise/ outliers in DBSCAN algorithm?

I did clustering on spatial datasets using DBSCAN algorithm and generating a lot of noise 193000 of 250000 data. is that a reasonable amount?
Depends on your data and problem.
If I generate random coordinates, 100% noise would be appropriate because the data is random noise.
First, to address the question in your title. By making eps
very large, it is easy to get no noise points and all the points are
in one big cluster. By making eps very small, you can easily
make all points be noise points. In general, somewhere in between
is what you are looking for. Your job is to find a value that produces
a meaningful clustering. That is where the remark of
#Anony-Mousse comes into play.
Depends on your data and problem
As he suggested, if you have uniform random data, maybe all
noise is the best answer. If you have Gaussian random data,
maybe one big cluster with a few outliers is good. But this is
supposed to help you understand the structure of your data.
What happens as you change eps? From your current clustering
with many noise points, what happens as you gradually increase eps?
Does it gradually add a few noise points into the existing clusters?
Is there some place where two clusters get merged into one? Is there
someplace that there is a sudden change in the number of clusters?
Also, can you interpret the clusters in terms of your variables?
Perhaps the difference between two clusters is that in one all the
values of some variable are low and in another cluster they are high. Considering whatever problem you are trying to solve,
do the clusters divide the data into meaningful groups? Try to use
the clusterings to find meaning in your data.

Automatically truncating a curve to discard outliers in matlab

I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).

Clusters based on distance

Here is my problem: I have a list of villages. For each village I computed the path distance between them and prepared a distance matrix. Now I want to identify clusters of villages which are close to each other.
I use Python 2.7 and I already used hierarchical clustering (provided by scypy) to cluster the distance matrix. By looking at it as a human being, I can identify the nearest villages, but I need to automate it. I need to get the elements which belong to each cluster.
I was also wondering how to retrieve the clusters once I had created and cut the dendrogram. Since this is unanswered and may come up for others with a similar question, I'll answer according to what I was looking for, making some assumptions since this is an old question.
The first step is that you need to determine where to cut the dendrogram. You can do this a variety of ways, but I'll assume you already know how to do this, since you're looking at the dendrogram and seem to have satisfied yourself that you have clustered the data. If you don't know where to cut, you could start with something simple like cutting at the max distance. But really, where to cut is a different, very long discussion which I will assume you have figured out how to do (since I had done so at this point in my search).
Now I assume you have a dendrogram, and you know where to cut it, and maybe you even have it plotted with the cut line. But you want to do something more with the clusters, so you need to label the points you clustered. This can be done using the flat cluster (fcluster()) function in scipy.
from scipy.cluster.hierarchy import fcluster
clusters=fcluster(Z,distance,criterion='distance')
print(clusters)
Z is the hierarchical linkage matrix (as from scipy's linkage() function) which I assume you had already created. distance is the distance at which you are cutting the dendrogram (but there are other ways to cut the dendrogram, see source for how to do this with fcluster).
This returns a numpy array denoting which observation is in which cluster. Now you can append this to your data as a new column and go to town (or village) with it.

Mapping Vision Outputs To Neural Network Inputs

I'm fairly new to MATLAB, but have acquainted myself with Simulink and Computer Vision over the past few days. My problem statement involves taking a traffic/highway video input and detecting if an accident has occurred.
I plan to do this by extracting the values of centroid to plot trajectory, velocity difference (between frames) and distance between two vehicles. I can successfully track the centroids, and aim to derive the rest of the features.
What I don't know is how to map these to ANN. I mean, every image has more than one vehicle blobs, which means, there are multiple centroids in a single frame/image. So, how does NN act on multiple inputs (the extracted features per vehicle) simultaneously? I am obviously missing the link. Help me figure it out please.
Also, am I looking at time series data?
I am not exactly sure about your question. The problem can be both time series data and not. You might be able to transform the time series version of the problem, such that it can be solved using ANN, but it is sort of a Maslow's hammer :). Also, Could you rephrase the problem.
As you said, you could give it features from two or three frames and then use the classifier to detect accident or not, but it might be difficult to train such a classifier. The problem is really difficult and the so you might need tons of training samples to get it right, esp really good negative samples (for examples cars travelling close to each other) etc.
There are multiple ways you can try to solve this problem of accident detection. For example : Build a classifier (ANN/SVM etc) to detect accidents without time series data. In which case your input would be accident images and non accident images or some sort of positive and negative samples for training and later images for test. In this specific case, you are not looking at the time series data. But here you might need lots of features to detect the same (this in some sense a single frame version of the problem).
The second method would be to use time series data, in which case you will have to detect the features, track the features (say using Lucas Kanade/Horn and Schunck) and then use the information about velocity and centroid to detect the accident. You might even be able to formulate it for HMMs.