Measuring the "remoteness" of a node in a graph - networkx

I mapped out all the edges of a graph in the ancient BBS game 'TradeWars 2002'. There are 1000 nodes. The graph is officially a directed graph, although most edges between nodes are undirected. The graph is strongly connected.
I modelled the universe in networkx. I'd like to use networkx methods to identify the "most remote" nodes in the network. I don't know how to articulate "most-remote" in graph theory terminology though. But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes. And the idea that on the edge of the well-connected nodes, there might be a string of nodes that extend out along a single path that terminates.
I visualization of what I imagine is node 733. Pretty unlikely someone accidentally stumbles onto that one, compared to other better-connected nodes.
What could I use from networkx library to quantify some measure of 'remoteness'?
This is the entire universe:

But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes.
As #Joel mentioned, there are many centrality measures available, and there are often strong correlations between them such that many of them will probably give you more or less what you want.
That being said, I think the class of centrality measures that most closely reflect your intuition are based on random walks. Some of these are pretty costly to compute (although see this paper for some recent improvements on that front) but luckily there is a strong correspondence between the Eigenvector centrality and the frequency with which nodes are visited by a random walker.
The implementation in networkx is available via networkx.algorithms.centrality.eigenvector_centrality.

networkx has a collection of algorithms for this kind of problems: centrality. For example, you can use the simpliest function: closeness_centrality:
# Create a random graph
G = nx.gnp_random_graph(50, 0.1)
nx.closeness_centrality(G)
{0: 0.3888888888888889,
1: 0.45794392523364486,
2: 0.35507246376811596,
3: 0.4375,
4: 0.4083333333333333,
5: 0.3684210526315789,
...
# Draw the graph
labels = {n: n for n in G.nodes}
nx.draw(G, with_labels=True, labels=labels)
And the most remote (the less central) nodes can be listed by returning nodes with the least closeness_centrality (note nodes IDs and nodes in blue circles in the upper picture:
c = nx.closeness_centrality(G)
sorted(c.items(), key=lambda x: x[1])[:5]
[(48, 0.28823529411764703),
(7, 0.33793103448275863),
(11, 0.35251798561151076),
(2, 0.35507246376811596),
(46, 0.362962962962963)]

Related

Finding "bubbles" in a graph

In a game, we have a universe described as a strongly-connected graph full of sectors and edges. Occasionally there are small pockets, players call them 'bubbles', where a small group of nodes all access the rest of the network through a single node. In the graph below, sectors 769, 195, 733, 918, and 451 are all only reachable via node 855. If you can guard 855 effectively, then those other sectors are safe. Other nodes on the chart have more edges (purple lines) and aren't 'bubbles' in the player nomenclature.
In a 1000 or 5000-node network, it's not easy to find these sorts of sub-structures. But I suspect this idea is described in graph theory somehow, and so probably is searchable for in networkx.
Could anyone suggested a graph-theory approach to systematically find structures like this? To be clear the graph is a directed graph but almost all edges end up being bi-directional in practice. Edges are unweighted.
Graph theory has no definitions for your "bubbles", but has the similar definition - bridges. Bridge is the edge, which removal increases the number of connected components. As you can see, it is exactly what you need. networkx has a bunch of algorithms to find bridges. Curiously enough, it is called bridges :)
Example:
import networkx as nx
G = nx.Graph()
G.add_edges_from([
(1,2),(1,3),(2,3),
(1,4),(4,5),(5,6),
(6,7),(7,4)
])
nx.draw(G)
list(nx.bridges(G))
[(1, 4)]

How to merge clustering results for different clustering approaches?

Problem: It appears to me that a fundamental property of a clustering method c() is whether we can combine the results c(A) and c(B) by some function f() of two clusterings in a way that we do not have to apply the full clustering c(A+B) again but instead do f(c(A),c(B)) and still end up with the same result:
c(A+B) == f(c(A),c(B))
I suppose that a necessary condition for some c() to have this property is that it is determistic, that is the order of its internal processing is irrelevant for the result. However, this might not be sufficient.
It would be really nice to have some reference where to look up which cluster methods support this and what a good f() looks like in the respective case.
Example: At the moment I am thinking about DBSCAN which should be deterministic if I allow border points to belong to multiple clusters at the same time (without connecting them):
One point is reachable from another point if it is in its eps-neighborhood
A core point is a point with at least minPts reachable
An edge goes from every core point to all points reachable from it
Every point with incoming edge from a core point is in the same cluster as the latter
If you miss the noise points then assume that each core node reaches itself (reflexivity) and afterwards we define noise points to be clusters of size one. Border points are non-core points. Afterwards if we want a partitioning, we can assign randomly the border points that are in multiple clusters to one of them. I do not consider this relevant for the method itself.
Supposedly the only clustering where this is efficiently possible is single linkage hierarchical clustering, because edges removed from A x A and B x B are not necessary for finding the MST of the joined set.
For DBSCAN precisely, you have the problem that the core point property can change when you add data. So c(A+B) likely has core points that were not core in either A not B. This can cause clusters to merge. f() supposedly needs to re-check all data points, i.e., rerun DBSCAN. While you can exploit that core points of the subset must be core of the entire set, you'll still need to find neighbors and missing core points.

Interpretation of 'ufactor' on a toy graph clustering

I am trying to do a imbalanced partition by METIS. I do not need equal number of vertices in each cluster(which is done by default in METIS). My graph has no constraints, it's a undirected unweighted graph. Here is a example toy graph clustered by METIS without no ufactor parameter.
Then, i tried with different ufactor and at value 143, METIS starts to
do the expected cluster like the following-
Can anybody interpret this. Eventually, I want to find a way to guess an ufactor from any unbalanced and undirected graph that will minimize the normalized cut without doing any balance necessarily.
Imbalance=1+(ufactor/1000). By default imbalance=1. Number of vertex in largest cluster-
imbalance*(number of vertex/number of cluster)
For first picture(default clustering)- number of vertex in larges cluster-
1*(14/2)=7, so the second cluster is also 14-7=7
In the second picture(ufactor 143)-
imbalance=1+143/1000=1.143
so, 1.143*(14/2)=8.001
That allows the largest cluster to have 8 vertex.

Understanding Titan Traversals

I am trying to write a highly scalable system with titandb. I have a situation where some nodes are highly connected.
Imagine the following example at much larger scale.
Now I have the following situations:
I want to find all the freinds of node X.
I want to find a specific friend of node X for example 5.
For scenario 1 I do: g.V(X).out(friend).toList(). For scenario 2 I do: g.V(X).out(friend).hasId(5).next(). Both of these traversals will work but scale poorly as X gets more friends. Can I optimise this situation by putting more information on the edge label ? For example if on the edge between X and 5 I change the label to freind_with_5 will the following be faster:
`g.V(X).out(freind_with_5).next()`
From my understanding this will be faster as only 1 edge will be traversed. However, if I make such a change to my edge labels how would I find all the friends of X ?
You could encode data into your edge label, but I would say that do that at the cost of complicating your graph schema considerably and, as you note, make it hard to do simple things like "find all my friends". I don't think you should take that approach.
The preferred method for dealing with this is with vertex-centric indices. If you denormalize any data to your edges, you should do it with those indices in mind (and not by encoding that data into the edge label). Put some unique identifier for the friend on the "friend" edge and index that.
If your supernodes are especially large (millions+ edges) you should also consider Titan's vertex partitioning feature.

Heuristics to find "surprising" mutual friends

I have the undirected friends graph of my Facebook friends, G i.e. G[i][j] = G[j][i] = true iff my friend i and friend j are friends with each other. I want to find "surprising" mutual friends i.e. pairs of my friends I normally would not expect to know each other. What are some good heuristics/algorithms I can apply? My initial idea is to run a clustering algorithm (not sure which one is the best) and see if I can find edges going across clusters. Any other ideas? What's a good clustering algorithm I can use that takes in a G and spits out clusters.
Here is my idea. Friendship is an edge. Surprising friendship is an edge, such that if you remove the edge, the distance between the two nodes becomes very large.
The answer by Wu Yongzheng can be tied to an existing network concept that is a robust and perhaps more sensitive measure, i.e. a quantitative take on the distance between the nodes becomes very large. This concept is edge betweenness. In this context one would compute an estimated version. See e.g. https://en.wikipedia.org/wiki/Betweenness_centrality and http://igraph.sourceforge.net/doc/R/betweenness.html.