checking if a directed graph has only single one topological sort - directed-graph

I'm trying to write pseudo-code for an algorithm that suppose to check whether a directed graph has only single one topological ordering. I've already come up with the pseudo-code for a topological sort (using DFS), but it does not seem to help me much. I wonder if there is no sinks in that graph -then there's not a single one topological ordering (might it help?).

This is an improvement of this answer, as the best possible runtime is improved when it starts at a vertex with no outgoing edges.
If your directed graph has N vertices, and you have exactly one starting vertex with indegree 0,
Do DFS (but only on the starting vertex) to get a topological sort L.
If L doesn't have N vertices, either some vertices are unreachable (those are part of a cycle) or you need another starting vertex (so there are multiple toposorts).
For i in [0,N-2],
If there is no edge from L[i] to L[i+1],
Return false.
Return true.
Alternatively, modify Kahn’s algorithm: if more than one vertex is in the set (of vertices of indegree 0) while the algorithm is running, return false. Otherwise, return true.

Related

Measuring the "remoteness" of a node in a graph

I mapped out all the edges of a graph in the ancient BBS game 'TradeWars 2002'. There are 1000 nodes. The graph is officially a directed graph, although most edges between nodes are undirected. The graph is strongly connected.
I modelled the universe in networkx. I'd like to use networkx methods to identify the "most remote" nodes in the network. I don't know how to articulate "most-remote" in graph theory terminology though. But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes. And the idea that on the edge of the well-connected nodes, there might be a string of nodes that extend out along a single path that terminates.
I visualization of what I imagine is node 733. Pretty unlikely someone accidentally stumbles onto that one, compared to other better-connected nodes.
What could I use from networkx library to quantify some measure of 'remoteness'?
This is the entire universe:
But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes.
As #Joel mentioned, there are many centrality measures available, and there are often strong correlations between them such that many of them will probably give you more or less what you want.
That being said, I think the class of centrality measures that most closely reflect your intuition are based on random walks. Some of these are pretty costly to compute (although see this paper for some recent improvements on that front) but luckily there is a strong correspondence between the Eigenvector centrality and the frequency with which nodes are visited by a random walker.
The implementation in networkx is available via networkx.algorithms.centrality.eigenvector_centrality.
networkx has a collection of algorithms for this kind of problems: centrality. For example, you can use the simpliest function: closeness_centrality:
# Create a random graph
G = nx.gnp_random_graph(50, 0.1)
nx.closeness_centrality(G)
{0: 0.3888888888888889,
1: 0.45794392523364486,
2: 0.35507246376811596,
3: 0.4375,
4: 0.4083333333333333,
5: 0.3684210526315789,
...
# Draw the graph
labels = {n: n for n in G.nodes}
nx.draw(G, with_labels=True, labels=labels)
And the most remote (the less central) nodes can be listed by returning nodes with the least closeness_centrality (note nodes IDs and nodes in blue circles in the upper picture:
c = nx.closeness_centrality(G)
sorted(c.items(), key=lambda x: x[1])[:5]
[(48, 0.28823529411764703),
(7, 0.33793103448275863),
(11, 0.35251798561151076),
(2, 0.35507246376811596),
(46, 0.362962962962963)]

Interpretation of 'ufactor' on a toy graph clustering

I am trying to do a imbalanced partition by METIS. I do not need equal number of vertices in each cluster(which is done by default in METIS). My graph has no constraints, it's a undirected unweighted graph. Here is a example toy graph clustered by METIS without no ufactor parameter.
Then, i tried with different ufactor and at value 143, METIS starts to
do the expected cluster like the following-
Can anybody interpret this. Eventually, I want to find a way to guess an ufactor from any unbalanced and undirected graph that will minimize the normalized cut without doing any balance necessarily.
Imbalance=1+(ufactor/1000). By default imbalance=1. Number of vertex in largest cluster-
imbalance*(number of vertex/number of cluster)
For first picture(default clustering)- number of vertex in larges cluster-
1*(14/2)=7, so the second cluster is also 14-7=7
In the second picture(ufactor 143)-
imbalance=1+143/1000=1.143
so, 1.143*(14/2)=8.001
That allows the largest cluster to have 8 vertex.

Graph projections with Gremlin and Titan

I would like to extract graph projections from a graph so that I can build smaller graphs from large ones for specific use cases (in most of the cases I can think of these graph projections will have a smaller size than source graph). Consider the following simple example in my original graph (all nodes have different types aka labels):
A -> B -> C -> D
Since the path from A to D exists, in my new graph I would create the nodes A and D and the edge between them. These paths could be easily discovered with a traversal (or even with the subgraph step I think):
g.V().as('a').out('aTob').out('bToC').out('cToD').as('d').select('a', 'd');
The thing is that these kind of traversals where one does not start from a specific set of nodes (where an index would be used) are not very efficient since they require a full graph scan.
16:46:27 WARN com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx - Query requires iterating over all vertices. For better performance, use indexes
What kind of actions can be done here so that graph projections can be accomplished in an efficient way?

Keeping track of moves when pre-processing graphs

I am writing an algorithm in MATLAB to pre-process a large graph for use with a path-finding algorithm, and I am curious as to the best way that I can keep track of my moves in order to be able to reconstruct the solution and project it onto the original graph.
The pre-processing methods I am using so far are relatively simple; 3 techniques I am using are:
1) Remove long edges:
Any edge (a,b) that can be reached by sequence (a,c,b) where (a,b) > (a,c)+(c,b), is removed
2) Remove vertices with degree 1
If a vertex with one edge coming out of it is not either the start or end-point of the path, then that vertex will never be part of the path, and it can be removed
3) Remove vertices with degree 2
If a vertex b has two edges coming out of it, then b can be removed and edges (a,b) and (b,c) can be replaced by a single edge (a,c) with length (a,b) + (b,c).
The algorithm iterates through these 3 techniques until no further changes are possible in the graph, at which point it removes all the empty rows and columns in the graph adjacency matrix and returns the reduced graph for use with the path-finding algorithm.
The pre-processing algorithm works great, in some cases I am able to achieve a reduction of around 70% in the graph size, and my path-finding algorithm is able to find a path of the same quality as the un-processed graph but an order of magnitude faster.
My problem now is in reconstructing the solution on the original graph, so-called "post-processing".
I feel like I should be keeping track of all the moves my pre-processing algorithm makes and then applying them in reverse order after it has finished, I am just not quite sure how I should go about that..
Here is what I had in mind:
First, keep track of all the empty rows and columns I removed from the matrix after pre-processing and re-insert them.
Then have a simple vector with indices representing the move number and the value representing what type of move.
then have one cell array for each of the 3 move "types" containing the data from each move in the order they were performed, with their own iteration counter.
then if i iterate backwards over the move list, it will tell me which cell array to access, and then i can apply the reverse operation that is next on that list (kind of like a stack data structure)
this seems a bit unwieldy to me, so I was wondering if anyone else had any ideas as to a good method of keeping track of my moves that is easily reversible?
EDIT: I thought about posting this on the computer science stack exchange; but my question isn't really about the pre-processing methods themselves, but about data storage and retrieval and the implementation itself. But feel free to migrate it if you think it would be better suited elsewhere

Kademlia XOR metric properties purposes

In the Kademlia paper by Petar Maymounkov and David Mazières, it is said that the XOR distance is a valid non-Euclidian metric with limited explanations as to why each of the properties of a valid metric are necessary or interesting, namely:
d(x,x) = 0
d(x,y) > 0, if x != y
forall x,y : d(x,y) = d(y,x) -- symmetry
d(x,z) <= d(x,y) + d(y,z) -- triangle inequality
Why is it important for a metric to have these properties in general? Why is each of these properties necessary in the context of routing queries in the Kademlia Distributed Hash Table implementation?
In addition, the paper mentions that unidirectionality (for a given x, and a distance l, there exist only a single y for which d(x,y) = l) guarantees that all queries will converge along the same path. Why is that so?
I can only speak for Kademlia, maybe someone else can provide a more general answer. In the meantime...
d(x,x) = 0
d(x,y) > 0, if x != y
These two points together effectively mean that the closest point to x is x itself; every other point is further away. (This may seem intuitive, but other aspects of the XOR metric aren't.)
In the context of Kademlia, this is important since a lookup for node with ID x will yield that node as the closest. It would be awkward if that were not the case, since a search converging towards x might not find node x.
forall x,y : d(x,y) = d(y,x)
The structure of the Kademlia routing table is such that nodes maintain detailed knowledge of the address space closest to them, and exponentially decreasing knowledge of more distant address space. In short, a node tries to keep all the k closest contacts it hears about.
The symmetry is useful since it means that each of these closest contacts will be maintaining detailed knowledge of a similar part of the address space, rather than a remote part.
If we didn't have this property, it might be helpful to think of the search as more like the hands of a clock moving in one direction round a clockface. The node at 1 o'clock (Node1) is close to Node2 at 2 o'clock (30°), but Node2 is far from Node1 (330°). So imagine we're looking for the two closest to 3 o'clock (i.e. Node1 and Node2). If the search reaches Node2, it won't know about Node1 since it's far away. The whole lookup and topology would have to change.
d(x,z) <= d(x,y) + d(y,z)
If this weren't the case, it would be impossible for a node to know which contacts from its routing table to return during a lookup. It would know the k closest to the target, but there would be no guarantee that one of the other more distant contacts wouldn't yield a shorter overall path.
Because of this property and unidirectionality, different searches starting from vastly separated points will tend to converge down the same path.
The unidirectionality means that no two nodes can have the same distance from a given point. If that weren't the case, then the target point could be encircled by a bunch of nodes all the same distance from it. Then various different searches would be free pick any of those to pass through. However, unidirectionality guarantees that exactly one of this bunch will be the closest, and any search which chooses between this group will always select the same one.
I've been bashing my head on this for quite some time: how can the XOR - as in the number of differing bits, a proper Hamming distance - be the basis of a total order?
Well it can't, such a metric on its own is not enough for a comparable relationship, all it can do is dump nodes in circles around a point.
Then I read the paper more closely and noticed that it says "the XOR as an integer value" and it dawned on me: the crux is not the "XOR metric", but the length of the common prefix of the ID (of which XOR is a derivation mechanism.)
Take two nodes with the same Hamming distance from "self" and the length of their prefix common to "self": the one with shortest common prefix is the furthest node.
The paper uses "XOR distance metric" but it really should read "ID prefix length total ordering"
I think this may explain it a wee bit, let me know http://metaquestions.me/2014/08/01/shortest-distance-between-two-points-is-not-always-a-straight-line/
Basically each hop if it were only one bit at a time in a fully populated network (extreme) then would have twice the knowledge of the previous hop. As you converge the knowledge is greater until you get to the closest nodes whose knowledge is ultimate in the network.