Functions for pruning a NetworkX graph? - networkx

I am using NetworkX to generate graphs of some noisy data. I'd like to "clean up" the graph by removing branches that are spurious, and hope to avoid re-inventing the wheel.
For example, the linked picture shows a sample set of graphs, as colored nodes connected by gray lines. I'd like to prune the nodes/edges indicated by the white boxes: http://www.broadinstitute.org/~mbray/example_tree.png
Essentially, the nodes/edges to be removed are branches typically only a few nodes (< 3) in length. By removing them, I hope to have a tree with a minimum of branching but the branches that do remain are "suitably" long.
Before I start crafting code to examine subtrees for removal, are there NetworkX functions that can be used for this purpose?

You can use the betweenness_centrality score of the nodes. If the node with a low centrality score is connected to a node of remarkably higher centrality score, and has 3 edges, then you can remove the low centrality node. (the rest of the <3 connected nodes aren't connected to the main graph anymore.).
You'll need to experiment with the phrase "remarkably higher".

Related

Creation of two networks with the same node coordinates

I create a network add nodes and edges. I view it (it creates a dot and pdf file automatically). Later, I want to create a second network with the same nodes but different edges. I want to place the nodes in the same coordinates, so that I can make a comparison of both graphs easily. I tried to get the coordinates of the first graph, and tried to set the coordinates of the nodes) but I couldn't find proper functions to do that. I also checked networkx package. I also tried to get a copy of the first network, and delete the edges with no success. Can someone please show me how to create a second network with the same node coordinates?
This is the simple network creation code
import graphviz as G
network1 = G.Digraph(
graph_attr={...},
node_attr={...},
edge_attr={...} )
network.node("xxx")
network.node("yyy")
network.node("zzz")
network.edge("xxx", "yyy")
network.edge("yyy", "zzz")
network1.view(file_name)
First, calculate the node positions for the first graph using the layout of your choice (say, the spring layout):
node_positions = nx.layout.spring_layout(G1)
Now, you can draw this graph and any other graph with the same nodes in the same positions:
nx.draw(G1, with_labels=True, pos=node_positions)
nx.draw(G2, with_labels=True, pos=node_positions)
Graphviz's layers feature might also be interesting:
https://graphviz.org/faq/#FaqOverlays
Here is a working example of using layers - ignore the last two lines that create a video.
https://forum.graphviz.org/t/stupid-dot-tricks-2-making-a-video/109
And here is some more background:
https://forum.graphviz.org/t/getting-layers-to-work-with-svg/107

Choosing a networkx layout that takes edge labels into account

I'm plotting networkx weighted graphs using the draw_networkx_edge_labels function. My problem is that, since edges sometimes cross each other, it is not always clear from the plot which weight belongs to which edge. For instance, in the following plot it is not immediately clear whether 2 is the weight of (1,2) or (3,7).
I'm currently using the neato layout, which does not take edge labels into account. In particular, this is how I'm drawing a weighted graph g:
layout = nx.nx_pydot.graphviz_layout(g, prog='neato')
nx.draw(g, pos=layout)
edge_labels = nx.get_edge_attributes(g, 'weight')
nx.draw_networkx_edge_labels(g, pos=layout, edge_labels=edge_labels)
I know I can manually control the position of the label along an edge using the label_pos parameter, but my question is whether there exists a way to automatically plot the graph such that edge labels do not usually collide (either using a layout that takes labels into account or a method that "neatly" selects label positions along edges).
I'm not expecting something that always works, but since my graphs are relatively sparsely connected, I hope there's a method that at least has a tendency to work well.
I have been meaning to implement this in netgraph, my fork of the networkx drawing utilities, for a while now. Unfortunately, I have a job interview on Thursday, so I won't have time to write this anytime soon. The basic idea, however, is pretty simple, and is also already implemented in some R packages such as ggrepel and also ggnetwork.
The basic idea is that you use a force directed layout to position your labels, given a predetermined and fixed layout for your nodes and edges. So:
Compute a node layout using the layout of your choice.
Partition each edge into a chain of many, many nodes, and compute the positions of the "edge nodes" using the already known positions of the source and target nodes of the edge. This partitioning is to give each edge a "mass" in the following force directed layout.
For each edge, add a "label" node and connect it to the most central "edge node".
Compute a force-directed layout keeping all nodes but the label nodes fixed (e.g. using spring_layout in networkx).
You should now have sensible edge label coordinates that do not overlap any of the edges. Use plt.annotate to plot a connection between the edge and the edge label.

GKGraph GKGraphNode GKGridGraphNode, what's relationship for them?

I've read the document but still confused of them, could any guy can give me a clearly explaining, e.g.any image comparison? Thanks.
The Wikipedia article on Pathfinding might help, as might the related topics on graphs and graph search algorithms linked from there. Beyond that, here's an attempt at a quick explainer.
Nodes are places that someone can be, and their connections to other nodes define someone can travel between places. Together, a collection of (connected) nodes form a graph.
GKGraphNode is the most general form of node — these nodes don't know anything about where they are in space, just about their connections to other nodes. (That's enough for basic pathfinding, though... if you have a graph where A is connected to B and B is connected to C, the path from A to C goes through B regardless of where those nodes are located, like below.)
GKGraph is a collection of nodes, and provides functions that work the graph as a whole, like the important one for finding paths.
GKGridGraphNode and GKGraphNode2D are specialized versions of GKGraphNode that add knowledge of the node's position in space — either integer grid space (like a chessboard) or open 2D space. Once you've added that kind of information, a GKGraph containing these kinds of nodes can take distance into account when pathfinding.
For example, look at this image:
If we're just using GKGraphNode, all we're talking about is which nodes are connected to which. So if we ask for the shortest path from A to D, we can get either ACD or ABD, because it's an qual number of connections either way. But if we use GKGridGraphNode or GKGraphNode2D, we're looking at the lengths of the lines between nodes, in which case ACD is the shortest path.
Once you start locating your nodes in (some sort of coordinate) space, it helps to be able to operate on the graph as a whole in that space. That's where GKGridGraph and GKObstacleGraph come in.
GKGridGraph works with GKGridGraphNodes and lets you do things like create a graph to fill a set of dimensions (say, a 10x10 grid, with diagonal movement allowed) instead of making you create and connect a bunch of nodes yourself.
GKObstacleGraph adds more to free-2D-space graphs by letting you mark areas as impassable obstacles and automatically managing the nodes and connections to route around obstacles.
Hopefully this helps a bit. For more, besides the reference docs and guide, Apple also has a WWDC video that shows how this stuff works.

Additional forces to networkx spring_layout

I would like to add additional forces to networkx spring_layout.
I have a directected graph and I would like nodes to move to different sides according to the edges that they have. Nodes that have more outgoing edges should drift to nodes that have more ingoing edges should drift right. Another alternative would be. That these groups of nodes would drift towards each other, nodes with outgoing edges would get closer while nodes with ingoing edges would also get closer to each other.
I managed to look into to the source code of spring_layout of networkx http://networkx.lanl.gov/archive/networkx-0.37/networkx.drawing.layout-pysrc.html#spring_layout
but everything there is just beyond my comprehension
G.DiGraph()
G.add_edges_from([(1,5),(2,5),(3,5),(5,6),(5,7)])
The layout should show edges 1,2,3 closer to each other, the same regarding 6 and 7.
I imagine, that I could solve this by adding invisible edges via using MultiDiGraph. I could count ingoing and outgoing edges of each node and add invisible edges that connect the two groups. However, I am very sure that there are better ways of solving the problem.
Adding weights into the mix would be a good way to group things (with those invisible nodes). But the layouts have no way of knowing left from right. To get the exact layout you want you could specify each point's x,y coordinates.
import networkx as nx
G=nx.Graph()
G.add_node(1,pos=(1,1))
G.add_node(2,pos=(2,3))
G.add_node(3,pos=(3,4))
G.add_node(4,pos=(4,5))
G.add_node(5,pos=(5,6))
G.add_node(6,pos=(6,7))
G.add_node(7,pos=(7,9))
G.add_edges_from([(1,5),(2,5),(3,5),(5,6),(5,7)])
pos=nx.get_node_attributes(G,'pos')
nx.draw(G,pos)

How do I visualise clusters of users?

I have an application in which users interact with each-other. I want to visualize these interactions so that I can determine whether clusters of users exist (within which interactions are more frequent).
I've assigned a 2D point to each user (where each coordinate is between 0 and 1). My idea is that two users' points move closer together when they interact, an "attractive force", and I just repeatedly go through my interaction logs over and over again.
Of course, I need a "repulsive force" that will push users apart too, otherwise they will all just collapse into a single point.
First I tried monitoring the lowest and highest of each of the XY coordinates, and normalizing their positions, but this didn't work, a few users with a small number of interactions stayed at the edges, and the rest all collapsed into the middle.
Does anyone know what equations I should use to move the points, both for the "attractive" force between users when they interact, and a "repulsive" force to stop them all collapsing into a single point?
Edit: In response to a question, I should point out that I'm dealing with about 1 million users, and about 10 million interactions between users. If anyone can recommend a tool that could do this for me, I'm all ears :-)
In the past, when I've tried this kind of thing, I've used a spring model to pull linked nodes together, something like: dx = -k*(x-l). dx is the change in the position, x is the current position, l is the desired separation, and k is the spring coefficient that you tweak until you get a nice balance between spring strength and stability, it'll be less than 0.1. Having l > 0 ensures that everything doesn't end up in the middle.
In addition to that, a general "repulsive" force between all nodes will spread them out, something like: dx = k / x^2. This will be larger the closer two nodes are, tweak k to get a reasonable effect.
I can recommend some possibilities: first, try log-scaling the interactions or running them through a sigmoidal function to squash the range. This will give you a smoother visual distribution of spacing.
Independent of this scaling issue: look at some of the rendering strategies in graphviz, particularly the programs "neato" and "fdp". From the man page:
neato draws undirected graphs using ``spring'' models (see Kamada and
Kawai, Information Processing Letters 31:1, April 1989). Input files
must be formatted in the dot attributed graph language. By default,
the output of neato is the input graph with layout coordinates
appended.
fdp draws undirected graphs using a ``spring'' model. It relies on a
force-directed approach in the spirit of Fruchterman and Reingold (cf.
Software-Practice & Experience 21(11), 1991, pp. 1129-1164).
Finally, consider one of the scaling strategies, an attractive force, and some sort of drag coefficient instead of a repulsive force. Actually moving things closer and then possibly farther later on may just get you cyclic behavior.
Consider a model in which everything will collapse eventually, but slowly. Then just run until some condition is met (a node crosses the center of the layout region or some such).
Drag or momentum can just be encoded as a basic resistance to motion and amount to throttling the movements; it can be applied differentially (things can move slower based on how far they've gone, where they are in space, how many other nodes are close, etc.).
Hope this helps.
The spring model is the traditional way to do this: make an attractive force between each node based on the interaction, and a repulsive force between all nodes based on the inverse square of their distance. Then solve, minimizing the energy. You may need some fairly high powered programming to get an efficient solution to this if you have more than a few nodes. Make sure the start positions are random, and run the program several times: a case like this almost always has several local energy minima in it, and you want to make sure you've got a good one.
Also, unless you have only a few nodes, I would do this in 3D. An extra dimension of freedom allows for better solutions, and you should be able to visualize clusters in 3D as well if not better than 2D.