Max number of edges in a directed graph? - directed-graph

I am a bit confused about max number of edges in a directed graph with N nodes.
Various sources say its N*(N-1), argument being that from every node we can connect to other (N-1) remaining nodes, and hence total max number of edges
N * (N-1)
But in a directed graph we are allowed to move in only one direction between a pair of nodes. So if first node has N-1 options to move to, then second one would have one less, and so on.
What am I missing here?

There is an implied but unstated assumption about the sets of edges that are allowed.
Two nodes, A and B, might be connected by no edge, or an edge from A to B, or an edge from B to A. If those are the only possibilities, then the greatest possible number of edges is N(N-1)/2.
If it is also permitted to have an edge from A to B and an edge from B to A, then the greatest possible number of edges is N(N-1).

Related

Measuring the "remoteness" of a node in a graph

I mapped out all the edges of a graph in the ancient BBS game 'TradeWars 2002'. There are 1000 nodes. The graph is officially a directed graph, although most edges between nodes are undirected. The graph is strongly connected.
I modelled the universe in networkx. I'd like to use networkx methods to identify the "most remote" nodes in the network. I don't know how to articulate "most-remote" in graph theory terminology though. But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes. And the idea that on the edge of the well-connected nodes, there might be a string of nodes that extend out along a single path that terminates.
I visualization of what I imagine is node 733. Pretty unlikely someone accidentally stumbles onto that one, compared to other better-connected nodes.
What could I use from networkx library to quantify some measure of 'remoteness'?
This is the entire universe:
But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes.
As #Joel mentioned, there are many centrality measures available, and there are often strong correlations between them such that many of them will probably give you more or less what you want.
That being said, I think the class of centrality measures that most closely reflect your intuition are based on random walks. Some of these are pretty costly to compute (although see this paper for some recent improvements on that front) but luckily there is a strong correspondence between the Eigenvector centrality and the frequency with which nodes are visited by a random walker.
The implementation in networkx is available via networkx.algorithms.centrality.eigenvector_centrality.
networkx has a collection of algorithms for this kind of problems: centrality. For example, you can use the simpliest function: closeness_centrality:
# Create a random graph
G = nx.gnp_random_graph(50, 0.1)
nx.closeness_centrality(G)
{0: 0.3888888888888889,
1: 0.45794392523364486,
2: 0.35507246376811596,
3: 0.4375,
4: 0.4083333333333333,
5: 0.3684210526315789,
...
# Draw the graph
labels = {n: n for n in G.nodes}
nx.draw(G, with_labels=True, labels=labels)
And the most remote (the less central) nodes can be listed by returning nodes with the least closeness_centrality (note nodes IDs and nodes in blue circles in the upper picture:
c = nx.closeness_centrality(G)
sorted(c.items(), key=lambda x: x[1])[:5]
[(48, 0.28823529411764703),
(7, 0.33793103448275863),
(11, 0.35251798561151076),
(2, 0.35507246376811596),
(46, 0.362962962962963)]

checking if a directed graph has only single one topological sort

I'm trying to write pseudo-code for an algorithm that suppose to check whether a directed graph has only single one topological ordering. I've already come up with the pseudo-code for a topological sort (using DFS), but it does not seem to help me much. I wonder if there is no sinks in that graph -then there's not a single one topological ordering (might it help?).
This is an improvement of this answer, as the best possible runtime is improved when it starts at a vertex with no outgoing edges.
If your directed graph has N vertices, and you have exactly one starting vertex with indegree 0,
Do DFS (but only on the starting vertex) to get a topological sort L.
If L doesn't have N vertices, either some vertices are unreachable (those are part of a cycle) or you need another starting vertex (so there are multiple toposorts).
For i in [0,N-2],
If there is no edge from L[i] to L[i+1],
Return false.
Return true.
Alternatively, modify Kahn’s algorithm: if more than one vertex is in the set (of vertices of indegree 0) while the algorithm is running, return false. Otherwise, return true.

How to guarantee that all nodes get infected in gossip-based protocols?

In gossip-based protocols, how do we guarantee that all nodes get infected by the message?
If we selected a random number of nodes and send a message to these nodes, and these nodes did the same, there is a probability that some node will not receive the message.
Although I couldn't calculate it, it seems small. However, if the system is running for a long time, at some point one nodes will be unlucky and will be leftover.
It's a bit hard to answer, due to two reasons:
There isn't really a gossip-based protocol. At most, there are families of gossip-based algorithms.
The algorithms actually guarantee infection only under specific assumptions. E.g., if, as you put it, as "the system is running for a long time" any given link fails permanently under some exponential process (a very likely scenario), then with probability 1 some node will be completely isolated, and no protocol can overcome that.
However, IIUC, you're asking about a protocol with the following assumptions:
For any group V' ⊂ V of nodes, there is an active link u ∈ V' → v ∈ V ∖ V'.
Each node chooses uniformly d of its neighbors at each step, irrespective of their state, choices made by other nodes, total update state, etc.
Under these conditions, the problem you raised will have probability 0.
You can think about the infection as a Markov Chain where the system is at state i if i nodes are infected. Suppose some change originated at some s ∈ V, and so the system starts at state i.
By property 1., there is a link from the i infected nodes to one of the n - i others.
By property 2., the probability of selecting this link is at least 1 / n. This is because the node whose link happens to cross the cut, has at most n neighbors, but at least one neighbor across the cut. Even if its selection is entirely stateless and uninformed, that is the chance that it will choose this neighbor.
Therefore, the probability that this will not happen for j steps is (1 - d/n)j. Using the Union Bound, the probability that this will happen for any state i is at most n (1 - 1/n)j. Take j = n2, and this becomes n e- n; take j = n3, and this becomes n e- n2. Etc.
(Of course, gossip algorithm infection happens much sooner; this is an upper bound for the worst-possible conditions.)
So, if the system runs long enough, the probability that some node does not become infected, decreases to 0 (very quickly). For Anti-Entropy Gossip Protocols, this is enough. For some other protocols, as you suspected, there is a chance that some node will be missed for some update.
We can't provide an answer because you don't understand your problem (hence the question is ambiguous)
The topology of the network is unknown, but the answer depend on it
What's the stop condition of the algorithm? Does it stop or not?
Suppose that a given node is connected to all the other node (that's the topology) and each node perform the same action if it receive a message.
You could simplify your problem into smaller sub-problems (that's the divide-et-impera approach): imagine that any node perform just one attempt (i.e. i = 1).
Since any node picks the receiver completely at random and since this operation is done infinite times then eventually all the nodes will receive the message. How many iterations are required to reach a given confidence (ratio of node which received the message / no. of all nodes ) is up to you.
Once you get this including the repeated attempt i is straightforward.
I made a little simulation of what you're trying to do. http://jsfiddle.net/ut78sega/
function gossip(nodes, tries, startNode, reached) {
var stack = [startNode, tries];
while(stack.length > 0) {
var ttl = stack.pop();
var n = stack.pop();
reached[n] = 1;
if(ttl <= 0) { continue; }
for(var i=0; i < ttl; i++) {
stack.push(Math.floor(Math.random() * nodes), ttl - 1);
}
}
return reached;
}
node - number of total nodes
tries - the starting amount of random selections
startNode - the node that gets the first message
reached - a hash set of nodes that were reached by the current simulation
At each level of the recursive the number of tries is decreased by one. It takes ~9 tries to get 100% coverage of 65536 (2^16) nodes.

Kademlia XOR metric properties purposes

In the Kademlia paper by Petar Maymounkov and David Mazières, it is said that the XOR distance is a valid non-Euclidian metric with limited explanations as to why each of the properties of a valid metric are necessary or interesting, namely:
d(x,x) = 0
d(x,y) > 0, if x != y
forall x,y : d(x,y) = d(y,x) -- symmetry
d(x,z) <= d(x,y) + d(y,z) -- triangle inequality
Why is it important for a metric to have these properties in general? Why is each of these properties necessary in the context of routing queries in the Kademlia Distributed Hash Table implementation?
In addition, the paper mentions that unidirectionality (for a given x, and a distance l, there exist only a single y for which d(x,y) = l) guarantees that all queries will converge along the same path. Why is that so?
I can only speak for Kademlia, maybe someone else can provide a more general answer. In the meantime...
d(x,x) = 0
d(x,y) > 0, if x != y
These two points together effectively mean that the closest point to x is x itself; every other point is further away. (This may seem intuitive, but other aspects of the XOR metric aren't.)
In the context of Kademlia, this is important since a lookup for node with ID x will yield that node as the closest. It would be awkward if that were not the case, since a search converging towards x might not find node x.
forall x,y : d(x,y) = d(y,x)
The structure of the Kademlia routing table is such that nodes maintain detailed knowledge of the address space closest to them, and exponentially decreasing knowledge of more distant address space. In short, a node tries to keep all the k closest contacts it hears about.
The symmetry is useful since it means that each of these closest contacts will be maintaining detailed knowledge of a similar part of the address space, rather than a remote part.
If we didn't have this property, it might be helpful to think of the search as more like the hands of a clock moving in one direction round a clockface. The node at 1 o'clock (Node1) is close to Node2 at 2 o'clock (30°), but Node2 is far from Node1 (330°). So imagine we're looking for the two closest to 3 o'clock (i.e. Node1 and Node2). If the search reaches Node2, it won't know about Node1 since it's far away. The whole lookup and topology would have to change.
d(x,z) <= d(x,y) + d(y,z)
If this weren't the case, it would be impossible for a node to know which contacts from its routing table to return during a lookup. It would know the k closest to the target, but there would be no guarantee that one of the other more distant contacts wouldn't yield a shorter overall path.
Because of this property and unidirectionality, different searches starting from vastly separated points will tend to converge down the same path.
The unidirectionality means that no two nodes can have the same distance from a given point. If that weren't the case, then the target point could be encircled by a bunch of nodes all the same distance from it. Then various different searches would be free pick any of those to pass through. However, unidirectionality guarantees that exactly one of this bunch will be the closest, and any search which chooses between this group will always select the same one.
I've been bashing my head on this for quite some time: how can the XOR - as in the number of differing bits, a proper Hamming distance - be the basis of a total order?
Well it can't, such a metric on its own is not enough for a comparable relationship, all it can do is dump nodes in circles around a point.
Then I read the paper more closely and noticed that it says "the XOR as an integer value" and it dawned on me: the crux is not the "XOR metric", but the length of the common prefix of the ID (of which XOR is a derivation mechanism.)
Take two nodes with the same Hamming distance from "self" and the length of their prefix common to "self": the one with shortest common prefix is the furthest node.
The paper uses "XOR distance metric" but it really should read "ID prefix length total ordering"
I think this may explain it a wee bit, let me know http://metaquestions.me/2014/08/01/shortest-distance-between-two-points-is-not-always-a-straight-line/
Basically each hop if it were only one bit at a time in a fully populated network (extreme) then would have twice the knowledge of the previous hop. As you converge the knowledge is greater until you get to the closest nodes whose knowledge is ultimate in the network.

Facebook Programming Challenge - ByteLand

ByteLand 
Byteland consists of N cities numbered 1..N. There are M roads connecting some pairs of cities. There are two army divisions, A and B, which protect the kingdom. Each city is either protected by army division A or by army division B.
 
You are the ruler of an enemy kingdom and have devised a plan to destroy Byteland. Your plan is to destroy all the roads in Byteland disrupting all communication. If you attack any road, the armies from both the cities that the road connects comes for its defense. You realize that your attack will fail if there are soldiers from both armies A and B defending any road.
 
So you decide that before carrying out this plan, you will attack some of the cities and defeat the army located in the city to make your plan possible. However, this is considerably more difficult. You have estimated that defeating the army located in city i will take up ci amount of resources. Your aim now is to decide which cities to attack so that your cost is minimum and no road should be protected from both armies A and B.
 
----Please tell me if this approach is correct----
We need to sort the cities in terms of resources required to destroy the city. For each city we need to ask the following questions:
1) Did deletion of the previous city NOT result into a state which can destroy Byteland?
2) Does it connect any road?
3) Does it connect any road which is armed by a different city?
If all of these conditions are true, we'll proceed towards destroying the city and record the total cost incurred so far and also determine if destruction of this city will lead to overall destruction of Byteland.
Since the cities are arranged in increasing order of the cost incurred, we can stop wherever we find the desired set of deletions.
You need only care about roads that link two cities with different armies - links between A and B or links between B and A, so let's delete all links from A to A or B to B.
You want to find a set of points such that each link has at least one point on it, which is a minimum weight vertex cover. On an arbitrary graph this would be NP-complete. However, your graph only ever has nodes of type A linked to nodes of type B, or the reverse - it is a bipartite graph with these two types of nodes as the two parties. So you can find a minimum weight vertex cover by using an algorithm for finding minimum weight vertex covers on bipartite graphs. Searching for this, I find e.g. http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-854j-advanced-algorithms-fall-2008/assignments/sol5.pdf
mcdowella,
But the vertices have a cost to them and the minimum vertex cover would not produce the right vertices to remove. Imagine 2 vertices (A army) pointing to the third one (B). First two vertices cost 1 each, where the third one costs 5. A minimum vertex cover would return the third one - but removing the third one costs more than removing both nodes with cost 1 + 1.
We would probably need some modified version of a minimum vertex cover here.