Finding "bubbles" in a graph - networkx

In a game, we have a universe described as a strongly-connected graph full of sectors and edges. Occasionally there are small pockets, players call them 'bubbles', where a small group of nodes all access the rest of the network through a single node. In the graph below, sectors 769, 195, 733, 918, and 451 are all only reachable via node 855. If you can guard 855 effectively, then those other sectors are safe. Other nodes on the chart have more edges (purple lines) and aren't 'bubbles' in the player nomenclature.
In a 1000 or 5000-node network, it's not easy to find these sorts of sub-structures. But I suspect this idea is described in graph theory somehow, and so probably is searchable for in networkx.
Could anyone suggested a graph-theory approach to systematically find structures like this? To be clear the graph is a directed graph but almost all edges end up being bi-directional in practice. Edges are unweighted.

Graph theory has no definitions for your "bubbles", but has the similar definition - bridges. Bridge is the edge, which removal increases the number of connected components. As you can see, it is exactly what you need. networkx has a bunch of algorithms to find bridges. Curiously enough, it is called bridges :)
Example:
import networkx as nx
G = nx.Graph()
G.add_edges_from([
(1,2),(1,3),(2,3),
(1,4),(4,5),(5,6),
(6,7),(7,4)
])
nx.draw(G)
list(nx.bridges(G))
[(1, 4)]

Related

Why am I getting different community detection results for NetworkX and Gephi?

I've got a network with undirected, weighted edges. I'm interested in whether nodes A and B are in the same community. When I run "modularity" in Gephi, the modularity class for node A is usually, though not always, distinct from that of B. However, when I switch over to Python and run, on the exact same underlying data, either louvain_communities() (from the networkx.algorithms.community module) or community_louvain.best_partition() (from the community module), A is always in the same community as B. I've tried this at various resolutions, but keep getting similar results: the Python modules are grouping A and B significantly more often than Gephi.
My question is: What is the Gephi method doing differently? My understanding is that Gephi uses the Louvain method; the variables I can see (resolution, using weights, etc.) seem to be the same. Why the disparity?
Edit:
I was asked to provide some code for this.
Basically I have edge tuples with weights like so:
edge_tuples = [(A,B,5),(A,C,11),(B,C,41),…]
I have nodes as a list:
nodes = [A,B,C,…]
I use networkx to make a graph:
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_weighted_edges_from(edge_tuples)
If I’m using community, I get the partition like so:
partition = community.community_louvain.best_partition(G,resolution=.7)
The resolution could be whatever, but .7 is in line with what I've tried before.
In Gephi, I'm just using ordinary node and edge tables. These are generated and exported as csv's in the process of creating the edge_tuples described above (i.e. I make them both out of the same data and just export a csv before making the networkx graph), so I don't see where the underlying data would be differing, although I'm certainly open to correction.

Measuring the "remoteness" of a node in a graph

I mapped out all the edges of a graph in the ancient BBS game 'TradeWars 2002'. There are 1000 nodes. The graph is officially a directed graph, although most edges between nodes are undirected. The graph is strongly connected.
I modelled the universe in networkx. I'd like to use networkx methods to identify the "most remote" nodes in the network. I don't know how to articulate "most-remote" in graph theory terminology though. But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes. And the idea that on the edge of the well-connected nodes, there might be a string of nodes that extend out along a single path that terminates.
I visualization of what I imagine is node 733. Pretty unlikely someone accidentally stumbles onto that one, compared to other better-connected nodes.
What could I use from networkx library to quantify some measure of 'remoteness'?
This is the entire universe:
But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes.
As #Joel mentioned, there are many centrality measures available, and there are often strong correlations between them such that many of them will probably give you more or less what you want.
That being said, I think the class of centrality measures that most closely reflect your intuition are based on random walks. Some of these are pretty costly to compute (although see this paper for some recent improvements on that front) but luckily there is a strong correspondence between the Eigenvector centrality and the frequency with which nodes are visited by a random walker.
The implementation in networkx is available via networkx.algorithms.centrality.eigenvector_centrality.
networkx has a collection of algorithms for this kind of problems: centrality. For example, you can use the simpliest function: closeness_centrality:
# Create a random graph
G = nx.gnp_random_graph(50, 0.1)
nx.closeness_centrality(G)
{0: 0.3888888888888889,
1: 0.45794392523364486,
2: 0.35507246376811596,
3: 0.4375,
4: 0.4083333333333333,
5: 0.3684210526315789,
...
# Draw the graph
labels = {n: n for n in G.nodes}
nx.draw(G, with_labels=True, labels=labels)
And the most remote (the less central) nodes can be listed by returning nodes with the least closeness_centrality (note nodes IDs and nodes in blue circles in the upper picture:
c = nx.closeness_centrality(G)
sorted(c.items(), key=lambda x: x[1])[:5]
[(48, 0.28823529411764703),
(7, 0.33793103448275863),
(11, 0.35251798561151076),
(2, 0.35507246376811596),
(46, 0.362962962962963)]

Personalized Page Rank

I have been trying to wrap me head around the personalized page rank algorithm and how it works. I came across this paper which gives this graph:see link to image below with weights calculated by PPR. I am have trouble reproducing the calculations with the models they give.
Can anyone break it down for me to help me wrap me head around the concept?
Thanks!
The paper is a good reference to personalized page rank. Basically my understanding, ppr scores tell you the probability from the source node move to the target node. It is a specific score describe the relationship between specific source and target nodes in the graph.
If you have problem to reproduce the results, you can use networkx in python, load a graph and compute ppr using
networkx.pagerank(graph, personalization={'a':0, 's':1, 'b':0....})
Networkx use power iteration approach to compute ppr, you can get exact result as what shown in the example.
The author of this thesis have c++ code here https://github.com/snap-stanford/snap/blob/master/snap-core/randwalk.h Since this method is random walk based approach, you could not get exactly same results as what shown in the example, but the rank is correct.
Hope that helps.

Understanding Titan Traversals

I am trying to write a highly scalable system with titandb. I have a situation where some nodes are highly connected.
Imagine the following example at much larger scale.
Now I have the following situations:
I want to find all the freinds of node X.
I want to find a specific friend of node X for example 5.
For scenario 1 I do: g.V(X).out(friend).toList(). For scenario 2 I do: g.V(X).out(friend).hasId(5).next(). Both of these traversals will work but scale poorly as X gets more friends. Can I optimise this situation by putting more information on the edge label ? For example if on the edge between X and 5 I change the label to freind_with_5 will the following be faster:
`g.V(X).out(freind_with_5).next()`
From my understanding this will be faster as only 1 edge will be traversed. However, if I make such a change to my edge labels how would I find all the friends of X ?
You could encode data into your edge label, but I would say that do that at the cost of complicating your graph schema considerably and, as you note, make it hard to do simple things like "find all my friends". I don't think you should take that approach.
The preferred method for dealing with this is with vertex-centric indices. If you denormalize any data to your edges, you should do it with those indices in mind (and not by encoding that data into the edge label). Put some unique identifier for the friend on the "friend" edge and index that.
If your supernodes are especially large (millions+ edges) you should also consider Titan's vertex partitioning feature.

Heuristics to find "surprising" mutual friends

I have the undirected friends graph of my Facebook friends, G i.e. G[i][j] = G[j][i] = true iff my friend i and friend j are friends with each other. I want to find "surprising" mutual friends i.e. pairs of my friends I normally would not expect to know each other. What are some good heuristics/algorithms I can apply? My initial idea is to run a clustering algorithm (not sure which one is the best) and see if I can find edges going across clusters. Any other ideas? What's a good clustering algorithm I can use that takes in a G and spits out clusters.
Here is my idea. Friendship is an edge. Surprising friendship is an edge, such that if you remove the edge, the distance between the two nodes becomes very large.
The answer by Wu Yongzheng can be tied to an existing network concept that is a robust and perhaps more sensitive measure, i.e. a quantitative take on the distance between the nodes becomes very large. This concept is edge betweenness. In this context one would compute an estimated version. See e.g. https://en.wikipedia.org/wiki/Betweenness_centrality and http://igraph.sourceforge.net/doc/R/betweenness.html.