How to guarantee that all nodes get infected in gossip-based protocols? - distributed-computing

In gossip-based protocols, how do we guarantee that all nodes get infected by the message?
If we selected a random number of nodes and send a message to these nodes, and these nodes did the same, there is a probability that some node will not receive the message.
Although I couldn't calculate it, it seems small. However, if the system is running for a long time, at some point one nodes will be unlucky and will be leftover.

It's a bit hard to answer, due to two reasons:
There isn't really a gossip-based protocol. At most, there are families of gossip-based algorithms.
The algorithms actually guarantee infection only under specific assumptions. E.g., if, as you put it, as "the system is running for a long time" any given link fails permanently under some exponential process (a very likely scenario), then with probability 1 some node will be completely isolated, and no protocol can overcome that.
However, IIUC, you're asking about a protocol with the following assumptions:
For any group V' ⊂ V of nodes, there is an active link u ∈ V' → v ∈ V ∖ V'.
Each node chooses uniformly d of its neighbors at each step, irrespective of their state, choices made by other nodes, total update state, etc.
Under these conditions, the problem you raised will have probability 0.
You can think about the infection as a Markov Chain where the system is at state i if i nodes are infected. Suppose some change originated at some s ∈ V, and so the system starts at state i.
By property 1., there is a link from the i infected nodes to one of the n - i others.
By property 2., the probability of selecting this link is at least 1 / n. This is because the node whose link happens to cross the cut, has at most n neighbors, but at least one neighbor across the cut. Even if its selection is entirely stateless and uninformed, that is the chance that it will choose this neighbor.
Therefore, the probability that this will not happen for j steps is (1 - d/n)j. Using the Union Bound, the probability that this will happen for any state i is at most n (1 - 1/n)j. Take j = n2, and this becomes n e- n; take j = n3, and this becomes n e- n2. Etc.
(Of course, gossip algorithm infection happens much sooner; this is an upper bound for the worst-possible conditions.)
So, if the system runs long enough, the probability that some node does not become infected, decreases to 0 (very quickly). For Anti-Entropy Gossip Protocols, this is enough. For some other protocols, as you suspected, there is a chance that some node will be missed for some update.

We can't provide an answer because you don't understand your problem (hence the question is ambiguous)
The topology of the network is unknown, but the answer depend on it
What's the stop condition of the algorithm? Does it stop or not?
Suppose that a given node is connected to all the other node (that's the topology) and each node perform the same action if it receive a message.
You could simplify your problem into smaller sub-problems (that's the divide-et-impera approach): imagine that any node perform just one attempt (i.e. i = 1).
Since any node picks the receiver completely at random and since this operation is done infinite times then eventually all the nodes will receive the message. How many iterations are required to reach a given confidence (ratio of node which received the message / no. of all nodes ) is up to you.
Once you get this including the repeated attempt i is straightforward.

I made a little simulation of what you're trying to do. http://jsfiddle.net/ut78sega/
function gossip(nodes, tries, startNode, reached) {
var stack = [startNode, tries];
while(stack.length > 0) {
var ttl = stack.pop();
var n = stack.pop();
reached[n] = 1;
if(ttl <= 0) { continue; }
for(var i=0; i < ttl; i++) {
stack.push(Math.floor(Math.random() * nodes), ttl - 1);
}
}
return reached;
}
node - number of total nodes
tries - the starting amount of random selections
startNode - the node that gets the first message
reached - a hash set of nodes that were reached by the current simulation
At each level of the recursive the number of tries is decreased by one. It takes ~9 tries to get 100% coverage of 65536 (2^16) nodes.

Related

Capacity Provisioning for Server Farms Markov Chain Queues

Suppose we are using an M/M/N model, how many servers would we need to keep the probability of an arriving job to wait to be less than 0.2. Given that jobs arrive at a rate of 400/second, and the processing times are exponentially distributed with a mean of 1 second.
So I used Erlang's C-formula:
P[queueing] = (1/c!)*(lambda/mu)^(c)*(1/(1-rho))*pi_o
And got an answer of 4 servers, however when I used the model they showed in the textbook:
rho < 1 -> (lambda)/(k*mu) -> k > lambda/mu
I get k = 400 servers. I'm not sure.

How can a Neural Network learn from testing outputs against external conditions which it can not directly control

In order to simplify the question and hopefully the answer I will provide a somewhat simplified version of what I am trying to do.
Setting up fixed conditions:
Max Oxygen volume permitted in room = 100,000 units
Target Oxygen volume to maintain in room = 100,000 units
Maximum Air processing cycles per sec == 3.0 cycles per second (min is 0.3)
Energy (watts) used per second is this formula : (100w * cycles_per_second)SQUARED
Maximum Oxygen Added to Air per "cycle" = 100 units (minimum 0 units)
1 person consumes 10 units of O2 per second
Max occupancy of room is 100 person (1 person is min)
inputs are processed every cycle and outputs can be changed each cycle - however if an output is fed back in as an input it could only affect the next cycle.
Lets say I have these inputs:
A. current oxygen in room (range: 0 to 1000 units for simplicity - could be normalized)
B. current occupancy in room (0 to 100 people at max capacity) OR/AND could be changed to total O2 used by all people in room per second (0 to 1000 units per second)
C. current cycles per second of air processing (0.3 to 3.0 cycles per second)
D. Current energy used (which is the above current cycles per second * 100 and then squared)
E. Current Oxygen added to air per cycle (0 to 100 units)
(possible outputs fed back in as inputs?):
F. previous change to cycles per second (+ or - 0.0 to 0.1 cycles per second)
G. previous cycles O2 units added per cycle (from 0 to 100 units per cycle)
H. previous change to current occupancy maximum (0 to 100 persons)
Here are the actions (outputs) my program can take:
Change cycles per second by increment/decrement of (0.0 to 0.1 cycles per second)
Change O2 units added per cycle (from 0 to 100 units per cycle)
Change current occupancy maximum (0 to 100 persons) - (basically allowing for forced occupancy reduction and then allowing it to normalize back to maximum)
The GOALS of the program are to maintain a homeostasis of :
as close to 100,000 units of O2 in room
do not allow room to drop to 0 units of O2 ever.
allows for current occupancy of up to 100 people per room for as long as possible without forcibly removing people (as O2 in room is depleted over time and nears 0 units people should be removed from room down to minimum and then allow maximum to recover back up to 100 as more and more 02 is added back to room)
and ideally use the minimum energy (watts) needed to maintain above two conditions. For instance if the room was down to 90,000 units of O2 and there are currently 10 people in the room (using 100 units per second of 02), then instead of running at 3.0 cycles per second (90 kw) and 100 units per second to replenish 300 units per second total (a surplus of 200 units over the 100 being consumed) over 50 seconds to replenish the deficit of 10,000 units for a total of 4500 kw used. - it would be more ideal to run at say 2.0 cycle per second (40 kw) which would produce 200 units per second (a surplus of 100 units over consumed units) for 100 seconds to replenish the deficit of 10,000 units and use a total of 4000 kw used.
NOTE: occupancy may fluctuate from second to second based on external factors that can not be controlled (lets say people are coming and going into the room at liberty). The only control the system has is to forcibly remove people from the room and/or prevent new people from coming into the room by changing the max capacity permitted at that next cycle in time (lets just say the system could do this). We don't want the system to impose a permanent reduction in capacity just because it can only support outputting enough O2 per second for 30 people running at full power. We have a large volume of available O2 and it would take a while before that was depleted to dangerous levels and would require the system to forcibly reduce capacity.
My question:
Can someone explain to me how I might configure this neural network so it can learn from each action (Cycle) it takes by monitoring for the desired results. My challenge here is that most articles I find on the topic assume that you know the correct output answer (ie: I know A, B, C, D, E inputs all are a specific value then Output 1 should be to increase by 0.1 cycles per second).
But what I want is to meet the conditions I laid out in the GOALS above. So each time the program does a cycle and lets say it decides to try increasing the cycles per second and the result is that available O2 is either declining by a lower amount than it was the previous cycle or it is now increasing back towards 100,000, then that output could be considered more correct than reducing cycles per second or maintaining current cycles per second. I am simplifying here since there are multiple variables that would create the "ideal" outcome - but I think I made the point of what I am after.
Code:
For this test exercise I am using a Swift library called Swift-AI (specifically the NeuralNet module of it : https://github.com/Swift-AI/NeuralNet
So if you want to tailor you response in relation to that library it would be helpful but not required. I am more just looking for the logic of how to setup the network and then configure it to do initial and iterative re-training of itself based on those conditions I listed above. I would assume at some point after enough cycles and different conditions it would have the appropriate weightings setup to handle any future condition and re-training would become less and less impactful.
This is a control problem, not a prediction problem, so you cannot just use a supervised learning algorithm. (As you noticed, you have no target values for learning directly via backpropagation.) You can still use a neural network (if you really insist). Have a look at reinforcement learning. But if you already know what happens to the oxygen level when you take an action like forcing people out, why would you learn such a simple facts by millions of evaluations with trial and error, instead of encoding it into a model?
I suggest to look at model predictive control. If nothing else, you should study how the problem is framed there. Or maybe even just plain old PID control. It seems really easy to make a good dynamical model of this process with few state variables.
You may have a few unknown parameters in that model that you need to learn "online". But a simple PID controller can already tolerate and compensate some amount of uncertainty. And it is much easier to fine-tune a few parameters than to learn the general cause-effect structure from scratch. It can be done, but it involves trying all possible actions. For all your algorithm knows, the best action might be to reduce the number of oxygen consumers to zero permanently by killing them, and then get a huge reward for maintaining the oxygen level with little energy. When the algorithm knows nothing about the problem, it will have to try everything out to discover the effect.

Why does a heartbeat take O(log N) time to propagate

I was reading about gossip style failure detection.
In the notes that I was reading it's stated that: a single heartbeat takes O(log(N)) time to propagate but this statement is not explained
Any idea why this is?
Because the most effective way of propagation in such case is using the Binary Tree structure (or any k-ary tree). First node sends message to its children, they send message to their children etc. Binary tree has height of log n, every level in the tree represents one stage of propagating messages, so the overall time equals O(log n).
You start by sending message to k nodes. Each of them sends a message to k nodes and collects back their responses. Each hop multiplies the number of nodes that have received the message by k. All the nodes have received the message when k^t >= N. The clock time it takes for this to happen is proportional to t, the number of hops.
k^t = N => log_k(N)=t
We know that the clock time is proportional to t, so it must be proportional to log_k(N).
I'm not familiar with gossip in particular but this answer applies to most broadcast messages on most cluster fabrics.

Kademlia XOR metric properties purposes

In the Kademlia paper by Petar Maymounkov and David Mazières, it is said that the XOR distance is a valid non-Euclidian metric with limited explanations as to why each of the properties of a valid metric are necessary or interesting, namely:
d(x,x) = 0
d(x,y) > 0, if x != y
forall x,y : d(x,y) = d(y,x) -- symmetry
d(x,z) <= d(x,y) + d(y,z) -- triangle inequality
Why is it important for a metric to have these properties in general? Why is each of these properties necessary in the context of routing queries in the Kademlia Distributed Hash Table implementation?
In addition, the paper mentions that unidirectionality (for a given x, and a distance l, there exist only a single y for which d(x,y) = l) guarantees that all queries will converge along the same path. Why is that so?
I can only speak for Kademlia, maybe someone else can provide a more general answer. In the meantime...
d(x,x) = 0
d(x,y) > 0, if x != y
These two points together effectively mean that the closest point to x is x itself; every other point is further away. (This may seem intuitive, but other aspects of the XOR metric aren't.)
In the context of Kademlia, this is important since a lookup for node with ID x will yield that node as the closest. It would be awkward if that were not the case, since a search converging towards x might not find node x.
forall x,y : d(x,y) = d(y,x)
The structure of the Kademlia routing table is such that nodes maintain detailed knowledge of the address space closest to them, and exponentially decreasing knowledge of more distant address space. In short, a node tries to keep all the k closest contacts it hears about.
The symmetry is useful since it means that each of these closest contacts will be maintaining detailed knowledge of a similar part of the address space, rather than a remote part.
If we didn't have this property, it might be helpful to think of the search as more like the hands of a clock moving in one direction round a clockface. The node at 1 o'clock (Node1) is close to Node2 at 2 o'clock (30°), but Node2 is far from Node1 (330°). So imagine we're looking for the two closest to 3 o'clock (i.e. Node1 and Node2). If the search reaches Node2, it won't know about Node1 since it's far away. The whole lookup and topology would have to change.
d(x,z) <= d(x,y) + d(y,z)
If this weren't the case, it would be impossible for a node to know which contacts from its routing table to return during a lookup. It would know the k closest to the target, but there would be no guarantee that one of the other more distant contacts wouldn't yield a shorter overall path.
Because of this property and unidirectionality, different searches starting from vastly separated points will tend to converge down the same path.
The unidirectionality means that no two nodes can have the same distance from a given point. If that weren't the case, then the target point could be encircled by a bunch of nodes all the same distance from it. Then various different searches would be free pick any of those to pass through. However, unidirectionality guarantees that exactly one of this bunch will be the closest, and any search which chooses between this group will always select the same one.
I've been bashing my head on this for quite some time: how can the XOR - as in the number of differing bits, a proper Hamming distance - be the basis of a total order?
Well it can't, such a metric on its own is not enough for a comparable relationship, all it can do is dump nodes in circles around a point.
Then I read the paper more closely and noticed that it says "the XOR as an integer value" and it dawned on me: the crux is not the "XOR metric", but the length of the common prefix of the ID (of which XOR is a derivation mechanism.)
Take two nodes with the same Hamming distance from "self" and the length of their prefix common to "self": the one with shortest common prefix is the furthest node.
The paper uses "XOR distance metric" but it really should read "ID prefix length total ordering"
I think this may explain it a wee bit, let me know http://metaquestions.me/2014/08/01/shortest-distance-between-two-points-is-not-always-a-straight-line/
Basically each hop if it were only one bit at a time in a fully populated network (extreme) then would have twice the knowledge of the previous hop. As you converge the knowledge is greater until you get to the closest nodes whose knowledge is ultimate in the network.

matlab minimum spanning tree keep busy

I use the grMinSpanTree function in matlab toolbox. But, when the number of nodes is high the code execution doesn't come to an end, it remains in forever busy state.
I tried a lot of samples and they all work well when number of nodes is below 4000. But when I try the one with 8000 nodes I run for several hours and still no result.
I am only beginner for graph theory and matlab. Is there any reason that may cause dead loop?
If E is the number of edges and V is the number of vertices, this greedy algorithm runs in O(E * V).
Therefore, the time growth is quadratic when E and V increase. There is no dead loop.
In addition, the memory space needed also increases and may force your computer to swap thus increasing dramatically the overall time.