How do I visualise clusters of users? - cluster-analysis

I have an application in which users interact with each-other. I want to visualize these interactions so that I can determine whether clusters of users exist (within which interactions are more frequent).
I've assigned a 2D point to each user (where each coordinate is between 0 and 1). My idea is that two users' points move closer together when they interact, an "attractive force", and I just repeatedly go through my interaction logs over and over again.
Of course, I need a "repulsive force" that will push users apart too, otherwise they will all just collapse into a single point.
First I tried monitoring the lowest and highest of each of the XY coordinates, and normalizing their positions, but this didn't work, a few users with a small number of interactions stayed at the edges, and the rest all collapsed into the middle.
Does anyone know what equations I should use to move the points, both for the "attractive" force between users when they interact, and a "repulsive" force to stop them all collapsing into a single point?
Edit: In response to a question, I should point out that I'm dealing with about 1 million users, and about 10 million interactions between users. If anyone can recommend a tool that could do this for me, I'm all ears :-)

In the past, when I've tried this kind of thing, I've used a spring model to pull linked nodes together, something like: dx = -k*(x-l). dx is the change in the position, x is the current position, l is the desired separation, and k is the spring coefficient that you tweak until you get a nice balance between spring strength and stability, it'll be less than 0.1. Having l > 0 ensures that everything doesn't end up in the middle.
In addition to that, a general "repulsive" force between all nodes will spread them out, something like: dx = k / x^2. This will be larger the closer two nodes are, tweak k to get a reasonable effect.

I can recommend some possibilities: first, try log-scaling the interactions or running them through a sigmoidal function to squash the range. This will give you a smoother visual distribution of spacing.
Independent of this scaling issue: look at some of the rendering strategies in graphviz, particularly the programs "neato" and "fdp". From the man page:
neato draws undirected graphs using ``spring'' models (see Kamada and
Kawai, Information Processing Letters 31:1, April 1989). Input files
must be formatted in the dot attributed graph language. By default,
the output of neato is the input graph with layout coordinates
appended.
fdp draws undirected graphs using a ``spring'' model. It relies on a
force-directed approach in the spirit of Fruchterman and Reingold (cf.
Software-Practice & Experience 21(11), 1991, pp. 1129-1164).
Finally, consider one of the scaling strategies, an attractive force, and some sort of drag coefficient instead of a repulsive force. Actually moving things closer and then possibly farther later on may just get you cyclic behavior.
Consider a model in which everything will collapse eventually, but slowly. Then just run until some condition is met (a node crosses the center of the layout region or some such).
Drag or momentum can just be encoded as a basic resistance to motion and amount to throttling the movements; it can be applied differentially (things can move slower based on how far they've gone, where they are in space, how many other nodes are close, etc.).
Hope this helps.

The spring model is the traditional way to do this: make an attractive force between each node based on the interaction, and a repulsive force between all nodes based on the inverse square of their distance. Then solve, minimizing the energy. You may need some fairly high powered programming to get an efficient solution to this if you have more than a few nodes. Make sure the start positions are random, and run the program several times: a case like this almost always has several local energy minima in it, and you want to make sure you've got a good one.
Also, unless you have only a few nodes, I would do this in 3D. An extra dimension of freedom allows for better solutions, and you should be able to visualize clusters in 3D as well if not better than 2D.

Related

Is there any regularity-detection tool for regions inside an image?

I'm working on MATLAB on some regions inside an image. I'm at a point in which I would like to be able to separate regions which exhibit some kind of regularity (e.g., being circle-ish or square-ish) from regions which does not resemble any known figure and which for my application are mere noise. I'll illustrate this using a descriptive MS Paint image:
Is there any tool that, most of the times (or even less, I know this can't be 100/100) will recognize the red thing as being different?
I'll deal with many shapes in a single image, so I don't mind if I carry on some red monsters along the way, as long as the majority of them is kicked out. Of course I know the indices of these regions, so I can manipulate them in MATLAB.
Many algorithms come to mind, e.g., getting the boundary and checking for its regularity/the number of times it changes curvature/..., checking for variations in vertical length through different columns (nearly 0 for the linear feature, really high for the red stuff), ...
However I was hoping in some help from a tool out there. It doesn't matter if this tool won't cover all cases (for example, will kick out circles), I've been very broad to get the maximum number of inputs from you guys - any tool will be inspiring and helpful (and, however, we can't expect a perfect answer for the deeper question - recognizing regular shapes - which seems more like a AI field of research). I also think that, while being broad, this is totally non-subjective so should fit in SO. Thank you.
Side note 1: I'll deal mostly with elongated, extended features like the top-right one, so circles are not that relevant.
Side note 2: To be 100% clear, I would need something (be it an already existant tool, or some ideas pointed out by you) that acts on the indices of the shapes, in terms of rows-columns into the original image, or on the boundary of the shape itself.
Side note 3: Apart from tools/suggestions/ideas, you are welcomed to write down some lines of code ;) I'm getting the regions as connected components from bwconncomp.
I had to solve a similar problem recently that involved counting the number of indentations on blobs within in an image (basically, the connected components returned by bwconncomp). The method I used was to look at curvature changes along the boundary calculated via the FFT. In your case, the red blobs would have a large number of curvature variations, whereas the black regions would not. It's a pretty easy calculation and relatively fast. The code is on github here:
https://github.com/mjsottile/blobdents
The file of interest is src/countindents.m. A short description of the approach is here:
http://arxiv.org/abs/1501.07692
I went for the easier road as suggested by #Mikhail in comments.
I found out regionprops has a really helpful tool called Solidity. Quoting docs,
Returns a scalar specifying the proportion of the pixels in the convex hull that are also in the region. Computed as Area/ConvexArea.
Convex hull is defined as the smallest convex polygon that can contain the region. So Solidity goes up to 1 if the shape is kind of regular and has no convexity changes; down to 0 for my red shape, which leaves space between itself and the convex polygon.
Of course it never reaches 0, lowest value should belong to a kind of +-shaped sign.

Hashing a graph to find duplicates (including rotated and reflected versions)

I am making a game that involves solving a path through graphs. Depending on the size of the graph this can take a little while so I want to cache my results.
This has me looking for an algorithm to hash a graph to find duplicates.
This is straightforward for exact copies of a graph, I simply use the node positions relative to the top corner. It becomes quite a bit more complicated for rotated or even reflected graphs. I suspect this isn't a new problem, but I'm unsure of what the terminology for it is?
My specific case is on a grid, so a node (if present) will always be connected to its four neighbors, north, south, east and west. In my current implementation each node stores an array of its adjacent nodes.
Suggestions for further reading or even complete algorithms are much appreciated.
My current hashing implementation starts at the first found node in the graph which depends on how i iterate over the playfield, then notes the position of all nodes relative to it. The base graph will have a hash that might be something like: 0:1,0:2,1:2,1:3,-1:1,
I suggest you do this:
Make a function to generate a hash for any graph, position-independent. It sounds like you already have this.
When you first generate the pathfinding solution for a graph, cache it by the hash for that graph...
...Then also generate the 7 other unique forms of that graph (rotated 90deg; rotated 270deg; flipped x; flipped y; flipped x & y; flipped along one diagonal axis; flipped along the other diagonal axis). You can of course generate these using simple vector/matrix transformations. For each of these 7 transformed graphs, you also generate that graph's hash, and cache the same pathfinding solution (which you first apply the same transform to, so the solution maps appropriately to the new graph configuration).
You're done. Later your code will look up the pathfinding solution for a graph, and even if it's an alternate (rotated, flipped) form of the graph you found the earlier solution for, the cache already contains the correct solution.
I spent some time this morning thinking about this and I think this is probably the most optimal solution. But I'll share the other over-analyzed versions of the solution that I was also thinking about...
I was considering the fact that what you really needed was a function that would take a graph G, and return the "canonical version" of G (which I'll call G'), AND the transform matrix required to convert G to G'. (It seemed like you would need the transform so you could apply it to the pathfinding data and get the correct path for G, since you would have just stored the pathfinding data for G'.) You could, of course, look up pathfinding data for G', apply the transform matrix to it, and have your pathfinding solution.
The problem is that I don't think there's any unambiguous and performant way to determine a "canonical version" of G, because it means you have to recognize all 8 variants of G and always pick the same one as G' based on some criteria. I thought I could do something clever by looking at each axis of the graph, counting the number of points along each row/column in that axis, and then rotating/flipping to put the more imbalanced half of the axis always in the top-or-left... in other words, if you pass in "d", "q", "b", "d", "p", etc. shapes, you would always get back the "p" shape (where the imbalance is towards the top-left). This would have the nice property that it should recognize when the graph was symmetrical along a given axis, and not bother to distinguish between the flipped versions on that axis, since they were the same.
So basically I just took the row-by-row/column-by-column point counts, counting the points in each half of the shape, and then rotating/flipping until the count is higher in the top-left. (Note that it doesn't matter that the count would sometimes be the same for different shapes, because all the function was concerned with was transforming the shape into a single canonical version out of all the different possible permutations.)
Where it fell down for me was deciding which axis was which in the canonical case - basically handling the case of whether to invert along the diagonal axis. Once again, for shapes that are symmetrical about a diagonal axis, the function should recognize this and not care; for any other case, it should have a criteria for saying "the axis of the shape that has the property [???] is, in the canonical version, the x axis of the shape, while the other axis will be the y axis". And without this kind of criteria, you can't distinguish two graphs that are flipped about the diagonal axis (e.g. "p" versus "σ"/sigma). The criteria I was trying to use was again "imbalance", but this turned out to be harder and harder to determine, at least the way I was approaching it. (Maybe I should have just applied the technique I was using for the x/y axes to the diagonal axes? I haven't thought through how that would work.) If you wanted to go with such a solution, you'd either need to solve this problem I failed to solve, or else give up on worrying about treating versions that are flipped about the diagonal axis as equivalent.
Although I was trying to focus on solutions that just involved calculating simple sums, I realized that even this kind of summing is going to end up being somewhat expensive to do (especially on large graphs) at runtime in pathfinding code (which needs to be as performant as possible, and which is the real point of your problem). In other words I realized that we were probably both overthinking it. You're much better off just taking a slight hit on the initial caching side and then having lightning-fast lookups based on the graph's position-independent hash, which also seems like a pretty foolproof solution as well.
Based on the twitter conversation, let me rephrase the problem (I hope I got it right):
How to compare graphs (planar, on a grid) that are treated as invariant under 90deg rotations and reflection. Bonus points if it uses hashes.
I don't have a full answer for you, but a few ideas that might be helpful:
Divide the problem into subproblems that are independently solvable. That would make
How to compare the graphs given the invariance conditions
How to transform them into a canonical basis
How to hash this canonical basis subject to tradeoffs (speed, size, collisions, ...)
You could try to solve 1 and 2 in a singe step. A naive geometric approach could be as follows:
For rotation invariance, you could try to count the edges in each direction and rotate the graph so that the major direction always point to the right. If there is no main direction you could see the graph as a point cloud of its vertices and use Eigenvectors and Priciple Compoment Analysis (PCA) to obtain the main direction and rotate it accordingly.
I don't have a smart solution for the reflection problem. My brute force way would be to just create the reflected graph all the time. Say you have a graph g and the reflected graph r(g). If you want to know if some other graph h == g you have to answer h == g || h == r(g).
Now onto the hashing:
For the hashing you probably have to trade off speed, size and collisions. If you just use the string of edges, you are high on speed and size and low on collisions. If you just take this string and apply some generic string hasher to it, you get different results.
If you use a short hash, with more frequent collisions, you can get achieve a rather small cost for comparing non matching graphs. The cost for matching graphs is a bit higher then, as you have to do a full comparison to see if they actually match.
Hope this makes some kind of sense...
best, Simon
update: another thought on the rotation problem if the edges don't give a clear winner: Compute the center of mass of the vertices and see to which side of the center of the bounding box it falls. Rotate accordingly.

Simulating physics for voxel constructions (Minecraft, Dwarf Fortress, etc)?

I'm hoping to prototype some very basic physics/statics simulations for "voxel-based" games like Minecraft and Dwarf Fortress, so that the game can detect when a player has constructed a structure that should not be able to stand up on its own.. Obviously this is a very fuzzy definition -- whether a structure is impossible depends upon multitude of material and environmental properties -- but the general idea is to motivate players to build structures that resemble the buildings we see in the real world. I'll describe what I mean in a bit more detail below, but I generally want to know if anyone could suggest either an potential approach to the problem or a resource that I could use.
Here's some examples of buildings that could be impossible if the material was not strong enough.
Here's some example situations. My understanding of this subject is not great but bear with me.
If this structure were to be made of concrete with dimensions of, say, 4m by 200m, it would probably not be able to stand up. Because the center of mass is not over its connection to the ground, I think it would either tip over or crack at the base.
The center of gravity of this arch lies between the columns holding it up, but if it was very big and made of a weak, heavy material, it would crumble under its own weight.
This tower has its center of gravity right over its base, but if it is sufficiently tall then it only takes a bit of force for the wind to topple it over.
Now, I expect that a full-scale real-time simulation of these physics isn't really possible... but there's a lot of ways that I could simplify the simulation. For example:
Tests for physics-defying structures could be infrequently and randomly performed, so a bad building doesn't crumble right as soon as it is built, but as much as a few minutes later.
Minecraft and Dwarf Fortress hardly perform rigid- or soft-body physics. For this reason, any piece of a building that is deemed to be physically impossible can simple "pop" into rubble instead of spawning a bunch of accurate physics props.
Have you considered taking an existing 3d environment physics engine and "rounding off" orientations of objects? In the case of your first object (the L-shaped thing), you could run a simulation of a continuous, non-voxelized object of similar shape behind the scenes and then monitor that object for orientation changes. In a simple case, if the object's representation of the continuous hits the ground, the object in the voxelized gameplay world could move its blocks to the ground.
I don't think there is a feasible way to do this. Minecraft has no notion of physical structure. So you will have to look at each block individually to determine if it should fall (there are other considerations but this is minimum). You would therefore need a way to distinguish between ground and "not ground". This is modeling problem first and foremost, not a programming problem (not even simulation design). I think this question is out of scope for SO.
For instance consider the following model, that may give you an indication of the complexities involved:
each block above height = 0 experiences a "down pull" = P, P may be any of the following:
0 if the box is supported by another box
m*g (where m is its mass which depends on material density * voxel volume) otherwise if it is free
F represents some "friction" or "glue" between vertical faces of boxes, it counteracts P.
This friction should have a threshold beyond which it "breaks" and the block then has a net pull downwards.
if m*g < sum F, box stays where it is. Otherwise, box falls.
F depends on the pairs of materials in contact
for n=2, so you can form a line of blocks between two towers
F is what causes the net pull of a box to be larger than m*g. For instance if you have two blocks a-b-c with c being on d, then a pulls on b, so b should be "heavier" than m*g where it contacts c. If this net is > F, then the pair a-b should fall.
You might be able to simulate the above and get interesting results, but you will find it really challenging to handle the case where there are two towsers with a line of blocks between them: the towers are coupled together by line of blocks, there is no longer a "tip" to the line of blocks. At this stage you might as well get out your physics books to create a system of boxes and springs and come up with equations that you might be able to solve numerically, but in a full 3D system you will have a 3D mesh of springs to navigate iteratively to converge to force values on each box and determine which ones move.
A professor of mine suggested that I look at this paper.
Additionally, I found the keyword for what it is I'm looking for. "Structural Analysis." I bought a textbook and I have a long road ahead of me.

fractal microscope simulator

I've done work on software used for controlling imaging hardware, such as microscopes, that are sometimes hard to get time on. This means it is difficult to test out new/different algorithms which would require access to the instrument. I'd like to create a synthetic instrument that could be used for some of these testing purposes, and I was thinking of using some kind of fractal image generation to create the synthetic images. The key would be to be able to generate features at many different 'magnifications' and locations in some sort of deterministic manner. This is because some of the algorithms being tested may need to pan/zoom and relocate previously 'imaged' areas. Onto these base images I can then apply whatever instrument 'defects' are appropriate (focus, noise, saturation, etc.).
I'm at a bit of a loss on how to select/implement a good fractal algorithm for the base image. Any help would be appreciated. Preferably it would have the following qualities:
Be fast at rendering new image areas.
Fairly wide 'feature' coverage at as many locations and scales as possible.
Be deterministic (but initialized from random starting parameters).
Ability to tune to make images look more like 'real' images.
Item 2 is important, for example a mandelbrot set, with its large smooth/empty regions, might not be good since the software controlling the synthetic scope might fall into one of these areas.
So far I've thought of using something like a mandelbrot, but randomly shifting/rotating/scaling and merging two or more fractal sets to get more complete 'feature' coverage.
I've also seen images of the fractal flame algorithms and they seem to generate images that might be useful (and nice to look at).
Finally, I've thought of using some sort of paused particle simulation run to generate images that are more cell-like (my current imaging target), but I'm not sure if this approach can be made to work with the other requirements.
Edit:
#Jeffrey - So it sounds like some kind of terrain generation might be the way to go, as long as I have complete control over the PSRNG. Perhaps I can use some stored initial seed + x position + y position to generate my random numbers? But then I am unsure of how to consistently generate the terrains across scales, except, as you mentioned, to create the base terrain at the coursest scale, and at certain pre-determined 'magnifications' add new deterministic pseudo-random variations to this base. I'd also have to be careful about when to generate the next level of terrain, since if I'm too aggressive I'd have to generate and integrate the results appropriately for display at the coarser level... This is why I initially was leaning toward a more 'traditional' fractal, since this integration from finer scales would be handled more implicitly (I think).
The idea behind a fractal terrain creation algorithm is to build the image at each scale separately. For a landscape it's easy: just make a small array of height values, and set them randomly. Then scale it up to a larger array, averaging the values so that the contour is smooth, and then add small random amounts to those values. Then scale it up, etc. The original small bumps have become mountains, and they are filled with complex terrain.
There are two particular difficulties with the problem posed here, though. First, you don't want to store any of these values, since it would be potentially huge. Secondly, the features at each scale are of a different kind than the features at other scales.
These problems are not insurmountable.
Basically, you would divide the image up into a grid, and using deterministic psedorandom numbers establish the key features of each square in the grid. For example, each square could have a certain density of cell types.
At the next level of magnification, subdivide each square into another grid, apply a gradiant of values across the grid that is based on the values of the containing square and its surrounding squares. Then apply pseudorandom variations to that seeded with the containing square's grid coordinates. For the random seed, always use the coordinates of the immediately containing square of the subdivision under consideration regardless of where the image is cropped, in order to ensure that it is recreated correctly accross multiple runs.
At some level of magnification the random values go from being densities of paticles types to particle locations. Then for each particle, there are partical features. Then features on those features.
Although arbitrary left/right and up/down scrolling will be desired, the image at all levels of magnification above the current scene will have to be calculated each time the frame is shifted to ensure that all necessary features are included. This way the image can be scrolled from one cell to another without loss of consistancy. Partical simulations can be used to ensure that cells or cell features don't overlap. This could be done in a repeatable, deterministic manner.
And don't forget to apply a smoothing gradient based on averages of surrounding squares at higher levels before adding in the random variations. Otherwise, the abrupt changes will make the squares themselves appear in the images!
This answer is somewhat rambling and probably confusing, but that is best I can explain it right now. I hope it helps!

Jelly physics 3d

I want to ask about jelly physics ( http://www.youtube.com/watch?v=I74rJFB_W1k ), where I can find some good place to start making things like that ? I want to make simulation of cars crash and I want use this jelly physics, but I can't find a lot about them. I don't want use existing physics engine, I want write my own :)
Something like what you see in the video you linked to could be accomplished with a mass-spring system. However, as you vary the number of masses and springs, keeping your spring constants the same, you will get wildly varying results. In short, mass-spring systems are not good approximations of a continuum of matter.
Typically, these sorts of animations are created using what is called the Finite Element Method (FEM). The FEM does converge to a continuum, which is nice. And although it does require a bit more know-how than a mass-spring system, it really isn't too bad. The basic idea, derived from the study of continuum mechanics, can be put this way:
Break the volume of your object up into many small pieces (elements), usually tetrahedra. Let's call the entire collection of these elements the mesh. You'll actually want to make two copies of this mesh. Label one the "rest" mesh, and the other the "world" mesh. I'll tell you why next.
For each tetrahedron in your world mesh, measure how deformed it is relative to its corresponding rest tetrahedron. The measure of how deformed it is is called "strain". This is typically accomplished by first measuring what is known as the deformation gradient (often denoted F). There are several good papers that describe how to do this. Once you have F, one very typical way to define the strain (e) is:
e = 1/2(F^T * F) - I. This is known as Green's strain. It is invariant to rotations, which makes it very convenient.
Using the properties of the material you are trying to simulate (gelatin, rubber, steel, etc.), and using the strain you measured in the step above, derive the "stress" of each tetrahdron.
For each tetrahedron, visit each node (vertex, corner, point (these all mean the same thing)) and average the area-weighted normal vectors (in the rest shape) of the three triangular faces that share that node. Multiply the tetrahedron's stress by that averaged vector, and there's the elastic force acting on that node due to the stress of that tetrahedron. Of course, each node could potentially belong to multiple tetrahedra, so you'll want to be able to sum up these forces.
Integrate! There are easy ways to do this, and hard ways. Either way, you'll want to loop over every node in your world mesh and divide its forces by its mass to determine its acceleration. The easy way to proceed from here is to:
Multiply its acceleration by some small time value dt. This gives you a change in velocity, dv.
Add dv to the node's current velocity to get a new total velocity.
Multiply that velocity by dt to get a change in position, dx.
Add dx to the node's current position to get a new position.
This approach is known as explicit forward Euler integration. You will have to use very small values of dt to get it to work without blowing up, but it is so easy to implement that it works well as a starting point.
Repeat steps 2 through 5 for as long as you want.
I've left out a lot of details and fancy extras, but hopefully you can infer a lot of what I've left out. Here is a link to some instructions I used the first time I did this. The webpage contains some useful pseudocode, as well as links to some relevant material.
http://sealab.cs.utah.edu/Courses/CS6967-F08/Project-2/
The following link is also very useful:
http://sealab.cs.utah.edu/Courses/CS6967-F08/FE-notes.pdf
This is a really fun topic, and I wish you the best of luck! If you get stuck, just drop me a comment.
That rolling jelly cube video was made with Blender, which uses the Bullet physics engine for soft body simulation. The bullet documentation in general is very sparse and for soft body dynamics almost nonexistent. You're best bet would be to read the source code.
Then write your own version ;)
Here is a page with some pretty good tutorials on it. The one you are looking for is probably in the (inverse) Kinematics and Mass & Spring Models sections.
Hint: A jelly can be seen as a 3 dimensional cloth ;-)
Also, try having a look at the search results for spring pressure soft body model - they might get you going in the right direction :-)
See this guy's page Maciej Matyka, topic of soft body
Unfortunately 2d only but might be something to start with is JellyPhysics and JellyCar