Cluster rectangles on a grid - cluster-analysis

I'm trying to cluster web page content based on visual proximity.
You can see a visual display of blocks on link below
http://i.stack.imgur.com/qzGKE.png
I tried to use a DBSCAN clustering with sckikit-learn with features below with not much success :
- left X coordinate of block (because content are frequently left aligned)
- right X coordinate of block (because content are frequently right aligned)
- top Y coordinate of block (to further close blocks)
Do you have any idea of better features

Have a look at Generalized DBSCAN (not available in scipy, though).
How about clustering objects together when they overlap or almost overlap (by 1 pixel)?
See: DBSCAN doesn't really use the distance. It is based on a binary "is close enough to" decision only.
Also note that DBSCAN is not restricted to vectors. DBSCAN can work with anything where you can define the "similar enough" predicate for.
So you might not need to "extract features", instead consider when you want two objects to be in the same cluster.

Related

Analyze 3D objects by Voxels

I plan to use OpenVDB to analyze 3D objects/meshes. The objective is:
To detect object surface regions with a certain criterion, like slope
Then manipulate those regions
The manipulation might be adding other 3D objects to those regions, for example
OpenVDB has some tools available:
Conversion Tools
Filters
Topological Operations
Level Set Tools
Morphological Operations
Geometric Transforms
Compositing Tools
...
It is a large set of confusing tools to choose from. Does anybody with OpenVDB experience know:
Is OpenVDB the proper library to achieve my objective
If so, which OpenVDB tool best suits my needs
Answer provided by OpenVDB community:
An important question is what you mean by "3D objects/meshes."
OpenVDB is very good at performing those regions with surfaces by
representing them as signed distance fields. But the word "mesh"
raises some alarm bells that you may want to maintain topology. In
this case another library may be more effective.
It also sounds like you have a problem domain you are trying to
explore. For that, I would not go straight to code but instead
explore solutions using 3d applications first. My own biased first
choice would be Houdini, whose apprentice version you can get for
free. This provides most of the VDB code as separate nodes. So, for
example, you can use a File SOP to load a mesh from disk, a VDB From
Polygons to convert it to a Signed Distance FIeld, and then VDB
Analysis to compute the Gradient. The gradient I think matches what
you are looking for as slope, but it is also possible you are looking
for curvature...
To return to mesh land, you can use a VDB Convert. Finally a ROP
Geometry can save it out.
Attached is a file showing a network to compute an approximate Y-slope
as a volume, apply it back to a mesh, and save to disk.
Attached file

Looking for a suggested Clustering technique

I have a series (let's say 1000) of images of a biological sample...living cells. Over this series, the data for each pixel will describe a time variant "wave", if you will, giving the measure of light intensity vs time. After performing an FFT for this wave, I'll have the frequency content and phase for each pixel.
My goal is to be able to find all the pixels that are measuring a single cell, and was wondering if some sort of clustering technique would give me what I'm looking for. After some research (I know almost nothing of cluster analysis) looking at KMeans, DBSCAN, and a few others, I'm unsure how to proceed.
Here's my criteria:
a cluster should consist of connected pixels, with a maximum size of
around 9-12 pixels (this is defined by the actual size of the cell in
the field of view). Putting more pixels in a cluster likely means
that the cluster contains more than one cell, and I'd prefer each
cluster to represent a single cell.
the cells are signalling (glowing) with some frequency/phase. These are not necessarily in sync, so I think that this might be useful in segregating the cells/clusters.
there is an unknown number of cells in each image, so an unknown number of clusters.
the images are segmented into smaller, sub-images for analysis (the reason for this is not relevant here). These sub-images are to be analyzed separately for clusters. The sub-images are about 100 x 100 pixels.
Any suggestions would be greatly appreciated. I'm just looking for help getting pointed in the right direction.
Probably the most flexible is the classic old hierarchical agglomerative clustering (HAC). For some reason, people always overlook this powerful method, and prefer the much more limited kmeans.
HAC is very nice to parameterize. It needs a distance or similarity (little requirements here - probably should be symmetric, but no triangle inequality necessary). And with the linkage you can control the cluster shape or diameters nicely. For example, with complete linkage you can control the maximum diameter of a cluster. This is probably useful here, and my suggestion.
The main drawbacks of HAC are (1) scalability: at 50.000 instances it will be slow and use too much memory, and of course that (2) you need to know what you want to do: you need to choose distance, linkage, and cut the dendrogram. With k-means, you only need to choose k to get a (bad) result.
DBSCAN is a great algorithm, but in your case it is likely to form clusters with multiple cells. So I'd rather try OPTICS instead which may be able to discover substructures where DBSCAN only sees a large blob.

Clustering in matlab

I have a 3d box with some points in it (1800).
Like this:
Now I have to cluster these points and it can't be done with k-means because you don't now the number of clusters. An other problem is that the box is periodic. So the points at the side top and bottom can belong to eacht other. Like in this image:
The right en left belong to each other.
How can I define these clusters with a specific distance as threshold, and implement that the box is periodic (so when you are one the end of one axis look at the beginning if these distances are below the threshold)?
Kind regards,
Glenn
The Wikipedia article on cluster analysis will answer your question.
Look for density based clustering algorithms, as your data looks very much like the design scenario of density based clustering to me.
Well, first things first, you can indeed use K-Means. Of course you will need to use a cluster validity index (google Silhouette width index, Calinski-Harabasz index, Dunn's index, etc.).
If you really don't want to use K-Means for some other reason, you may wish to use a hierarchical clustering algorithm such as the Ward Method (description in Wikipedia). You won't need to know the number of clusters a priori (however, can you truly claim that you are creating a taxonomy without being able to answer the most basic of questions: how many taxons are there?).
The fact that your box is periodic raises an interesting challenge. My first thought here is that the best way to approach the problem is not by changing the distance measure (which you could do), but by transforming the data (feature extraction).
Your box has 6 sides, but because its periodic its like if it had 3 sides. So, the left side and right side are "the same" (as are the top and bottom, and the front and back).
How about redefining each object over three features? each feature is the distance between the object and one of the "three" sides.
Best of luck!

Python Clustering Algorithms

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).
I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.
Here is an example of what dbscan find:
You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.
Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.
I've found one that requires NO a priori information/guesses and does very well for what I'm asking it to do. It's called Mean Shift and is located in SciKit-Learn. It's also relatively quick (compared to other algorithms like Affinity Propagation).
Here's an example of what it gives:
I also want to point out that in the documentation is states that it may not scale well.
When using DBSCAN it can be helpful to scale/normalize data or
distances beforehand, so that estimation of epsilon will be relative.
There is a implementation of DBSCAN - I think its the one
Anony-Mousse somewhere denoted as 'floating around' - , which comes
with a epsilon estimator function. It works, as long as its not fed
with large datasets.
There are several incomplete versions of OPTICS at github. Maybe
you can find one to adapt it for your purpose. Still
trying to figure out myself, which effect minPts has, using one and
the same extraction method.
You can try a minimum spanning tree (zahn algorithm) and then remove the longest edge similar to alpha shapes. I used it with a delaunay triangulation and a concave hull:http://www.phpdevpad.de/geofence. You can also try a hierarchical cluster for example clusterfck.
Your plot indicates that you chose the minPts parameter way too small.
Have a look at OPTICS, which does no longer need the epsilon parameter of DBSCAN.

Lucas Kanade Optical Flow, Direction Vector

I am working on optical flow, and based on the lecture notes here and some samples on the Internet, I wrote this Python code.
All code and sample images are there as well. For small displacements of around 4-5 pixels, the direction of vector calculated seems to be fine, but the magnitude of the vector is too small (that's why I had to multiply u,v by 3 before plotting them).
Is this because of the limitation of the algorithm, or error in the code? The lecture note shared above also says that motion needs to be small "u, v are less than 1 pixel", maybe that's why. What is the reason for this limitation?
#belisarius says "LK uses a first order approximation, and so (u,v) should be ideally << 1, if not, higher order terms dominate the behavior and you are toast. ".
A standard conclusion from the optical flow constraint equation (OFCE, slide 5 of your reference), is that "your motion should be less than a pixel, less higher order terms kill you". While technically true, you can overcome this in practice using larger averaging windows. This requires that you do sane statistics, i.e. not pure least square means, as suggested in the slides. Equally fast computations, and far superior results can be achieved by Tikhonov regularization. This necessitates setting a tuning value(the Tikhonov constant). This can be done as a global constant, or letting it be adjusted to local information in the image (such as the Shi-Tomasi confidence, aka structure tensor determinant).
Note that this does not replace the need for multi-scale approaches in order to deal with larger motions. It may extend the range a bit for what any single scale can deal with.
Implementations, visualizations and code is available in tutorial format here, albeit in Matlab not Python.