Interpretation of 'ufactor' on a toy graph clustering - cluster-analysis

I am trying to do a imbalanced partition by METIS. I do not need equal number of vertices in each cluster(which is done by default in METIS). My graph has no constraints, it's a undirected unweighted graph. Here is a example toy graph clustered by METIS without no ufactor parameter.
Then, i tried with different ufactor and at value 143, METIS starts to
do the expected cluster like the following-
Can anybody interpret this. Eventually, I want to find a way to guess an ufactor from any unbalanced and undirected graph that will minimize the normalized cut without doing any balance necessarily.

Imbalance=1+(ufactor/1000). By default imbalance=1. Number of vertex in largest cluster-
imbalance*(number of vertex/number of cluster)
For first picture(default clustering)- number of vertex in larges cluster-
1*(14/2)=7, so the second cluster is also 14-7=7
In the second picture(ufactor 143)-
imbalance=1+143/1000=1.143
so, 1.143*(14/2)=8.001
That allows the largest cluster to have 8 vertex.

Related

Measuring the "remoteness" of a node in a graph

I mapped out all the edges of a graph in the ancient BBS game 'TradeWars 2002'. There are 1000 nodes. The graph is officially a directed graph, although most edges between nodes are undirected. The graph is strongly connected.
I modelled the universe in networkx. I'd like to use networkx methods to identify the "most remote" nodes in the network. I don't know how to articulate "most-remote" in graph theory terminology though. But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes. And the idea that on the edge of the well-connected nodes, there might be a string of nodes that extend out along a single path that terminates.
I visualization of what I imagine is node 733. Pretty unlikely someone accidentally stumbles onto that one, compared to other better-connected nodes.
What could I use from networkx library to quantify some measure of 'remoteness'?
This is the entire universe:
But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes.
As #Joel mentioned, there are many centrality measures available, and there are often strong correlations between them such that many of them will probably give you more or less what you want.
That being said, I think the class of centrality measures that most closely reflect your intuition are based on random walks. Some of these are pretty costly to compute (although see this paper for some recent improvements on that front) but luckily there is a strong correspondence between the Eigenvector centrality and the frequency with which nodes are visited by a random walker.
The implementation in networkx is available via networkx.algorithms.centrality.eigenvector_centrality.
networkx has a collection of algorithms for this kind of problems: centrality. For example, you can use the simpliest function: closeness_centrality:
# Create a random graph
G = nx.gnp_random_graph(50, 0.1)
nx.closeness_centrality(G)
{0: 0.3888888888888889,
1: 0.45794392523364486,
2: 0.35507246376811596,
3: 0.4375,
4: 0.4083333333333333,
5: 0.3684210526315789,
...
# Draw the graph
labels = {n: n for n in G.nodes}
nx.draw(G, with_labels=True, labels=labels)
And the most remote (the less central) nodes can be listed by returning nodes with the least closeness_centrality (note nodes IDs and nodes in blue circles in the upper picture:
c = nx.closeness_centrality(G)
sorted(c.items(), key=lambda x: x[1])[:5]
[(48, 0.28823529411764703),
(7, 0.33793103448275863),
(11, 0.35251798561151076),
(2, 0.35507246376811596),
(46, 0.362962962962963)]

How to merge clustering results for different clustering approaches?

Problem: It appears to me that a fundamental property of a clustering method c() is whether we can combine the results c(A) and c(B) by some function f() of two clusterings in a way that we do not have to apply the full clustering c(A+B) again but instead do f(c(A),c(B)) and still end up with the same result:
c(A+B) == f(c(A),c(B))
I suppose that a necessary condition for some c() to have this property is that it is determistic, that is the order of its internal processing is irrelevant for the result. However, this might not be sufficient.
It would be really nice to have some reference where to look up which cluster methods support this and what a good f() looks like in the respective case.
Example: At the moment I am thinking about DBSCAN which should be deterministic if I allow border points to belong to multiple clusters at the same time (without connecting them):
One point is reachable from another point if it is in its eps-neighborhood
A core point is a point with at least minPts reachable
An edge goes from every core point to all points reachable from it
Every point with incoming edge from a core point is in the same cluster as the latter
If you miss the noise points then assume that each core node reaches itself (reflexivity) and afterwards we define noise points to be clusters of size one. Border points are non-core points. Afterwards if we want a partitioning, we can assign randomly the border points that are in multiple clusters to one of them. I do not consider this relevant for the method itself.
Supposedly the only clustering where this is efficiently possible is single linkage hierarchical clustering, because edges removed from A x A and B x B are not necessary for finding the MST of the joined set.
For DBSCAN precisely, you have the problem that the core point property can change when you add data. So c(A+B) likely has core points that were not core in either A not B. This can cause clusters to merge. f() supposedly needs to re-check all data points, i.e., rerun DBSCAN. While you can exploit that core points of the subset must be core of the entire set, you'll still need to find neighbors and missing core points.

checking if a directed graph has only single one topological sort

I'm trying to write pseudo-code for an algorithm that suppose to check whether a directed graph has only single one topological ordering. I've already come up with the pseudo-code for a topological sort (using DFS), but it does not seem to help me much. I wonder if there is no sinks in that graph -then there's not a single one topological ordering (might it help?).
This is an improvement of this answer, as the best possible runtime is improved when it starts at a vertex with no outgoing edges.
If your directed graph has N vertices, and you have exactly one starting vertex with indegree 0,
Do DFS (but only on the starting vertex) to get a topological sort L.
If L doesn't have N vertices, either some vertices are unreachable (those are part of a cycle) or you need another starting vertex (so there are multiple toposorts).
For i in [0,N-2],
If there is no edge from L[i] to L[i+1],
Return false.
Return true.
Alternatively, modify Kahn’s algorithm: if more than one vertex is in the set (of vertices of indegree 0) while the algorithm is running, return false. Otherwise, return true.

How can I write a logical process for finding the area of a point on a graph?

I have the following graph with 2 different parameters called p and t. 
Their relationship is experimentally found. Manually by knowing (t,p), you can simply find the area number (group) of the point based on where it is located. For example, point M(t,p), locates in area 3 and belongs to group number 3. However, I would like to write a code/logical approach which automatically finds the group numbers. therefore when it reads (t,p) it will find the location of the point and give the group/Area number it belongs.
Is there any solution in Matlab for this scope?  Graph
If you have the Image Processing Toolbox and your contours are closed, you can use imfill to fill them up (a bit like the bucket tool in Paint) and assign different values to each filled up region. Does this make sense to you? Let me know if you would like more detail.
Marta

buffer of clusters in a sparse matrix

I work with MATLAB.
I have a sparse matrix where I identified different clusters. The value within each cluster is equal, while each cluster has its own unique value. I have 0s as background (outside clusters). Here is an example with clusters 1 and 2:
A=(000002002000
110002222000
111000222200
110000022000
111000000000)
I'd like to use each cluster as "a polygon" and study the value of the outside neighbor pixels (a sort of buffer as in vector data). Obviously in the example it would output 0 as mean all the time, but the point is understanding how to do it, as I have to apply this to another matrix (I work with geolocated data, so I would use the buffer area to find mean values in specific rasters). Is there a way to do that? Also, if so, can I specify the width of this buffer (as number of pixels)?