Byzantine's General - distributed-computing

So I was reading Lamport's paper on Byzantine Generals in which he proves that for T malicious generals you need 2T+1 generals in a group to read a consensus. However I dont understand how. If there are T malicious nodes making up stuff, you just need T+1 votes to outvote them. Why is that not the case?

There is a section on Wikipedia about this:
One solution considers scenarios in which messages may be forged, but which will be Byzantine-fault-tolerant as long as the number of traitorous generals does not equal or exceed one third. The impossibility of dealing with one-third or more traitors ultimately reduces to proving that the 1 Commander + 2 Lieutenants problem cannot be solved, if the Commander is traitorous. The reason is, if we have three commanders, A, B, and C, and A is the traitor: when A tells B to attack and C to retreat, and B and C send messages to each other, forwarding A's message, neither B nor C can figure out who is the traitor, since it isn't necessarily A – the other commander could have forged the message purportedly from A. It can be shown that if n is the number of generals in total, and t is the number of traitors in that n, then there are solutions to the problem only when n is greater than or equal to 3t + 1

you just need T+1 votes to outvote them. Why is that not the case?
This makes sense if all loyal generals produce the same answer, but that's not the case for BGP systems, where each honest element can give you a different answer.
BGP is for systems where each element sees a different information. Example: redundant radars. It is not for systems where the elements are mirrored (ex. redundant HDs).
Example:
Generals: A, B, C;
Traitor: C;
A says "attack";
B says "retreat";
C says "attack" to A, and "retreat" to "B";
Result: A thinks it has reached agreement and it will attack alone;

Related

Legal Hierarchical Quorums in Zookeeper

I am trying to understand hierarchical quorums in Zookeeper. I may not understand the example shown in the documentation (here). Are votes [from at least two servers from each of two different groups] enough to form a legal quorum?
In my opinion, the example here does not gain the majority of all the weight; it only gains more than 4 ballots. A legal quorum should earn more than 5 ballots (9/2+1).
I also read the source code. The algorithm implementation is shown from line 352 to line 371. Zookeeper only checks if all groups have a majority and if the number of selected groups is larger than half of the group number.
Maybe I find the answer.
A different construction that uses weights and is useful in wide-area deployments (co-locations) is a hierarchical one. With this construction, we split the servers into disjoint groups and assign weights to processes. To form a quorum, we have to get a hold of enough servers from a majority of groups G, such that for each group g in G, the sum of votes from g is larger than half of the sum of weights in g. Interestingly, this construction enables smaller quorums. If we have, for example, 9 servers, we split them into 3 groups, and assign a weight of 1 to each server, then we are able to form quorums of size 4.
Note that two subsets of processes composed each of a majority of servers from each of a majority of groups necessarily have a non-empty intersection. It is reasonable to expect that a majority of co-locations will have a majority of servers available with high probability.

In paxos, what happens if a proposer is down after its proposal is rejected?

In this figure, the proposal of X is rejected.
At the end of the timeline, S1 and S2 accept X while S3, S4 and S5 accept Y. Proposer X is now supposed to re-send the proposal with value Y.
But what happens if proposer X gets down at that time? How does S1 and S2 eventually learn the value Y?
Thanks in advance!
It is a little hard to answer this from the fragment of a diagram that you've shared since it is not clear what exactly it means. It would be helpful if you could link to the source of that diagram so we can see more of the context of your question. The rest of this answer is based on a guess as to its meaning.
There are three distinct roles in Paxos, commonly known as proposer, acceptor and learner, and I think it aids understanding to divide things into these three roles. The diagram you've shared looks like it is illustrating a set of five acceptors and the messages that they have sent as part of the basic Synod algorithm (a.k.a. single-instance Paxos). In general there's no relationship between the sets of learners and acceptors in a system: there might be a single learner, or there might be thousands, and I think it helps to separate these concepts out. Since S1 and S2 are acceptors, not learners, it doesn't make sense to ask about them learning a value. It is, however, valid to ask about how to deal with a learner that didn't learn a value.
In practical systems there is usually also another role of leader which takes responsibility for pushing the system forward using timeouts and retries and fault detectors and so on, to ensure that all learners eventually learn the chosen value or die trying, but this is outside the scope of the basic algorithm that seems to be illustrated here. In other words, this algorithm guarantees safety ("nothing bad happens") but does not guarantee liveness ("something good happens"). It is acceptable here if some of the learners never learn the chosen value.
The leader can do various things to ensure that all learners eventually learn the chosen value. One of the simplest strategies is to get the learned value from any learner and broadcast it to the other learners, which is efficient and works as long as there is at least one running learner that's successfully learned the chosen value. If there is no such learner, the leader can trigger another round of the algorithm, which will normally result in the chosen value being learned. If it doesn't then its only option is to retry, and keep retrying until eventually one of these rounds succeeds.
In this figure, the proposal of X is rejected.
My reading of the diagram is that it is an ”accept request” that is rejected. Page 5 paragraph 1 of Paxos Made Simple describes this message type.
Proposer X is now supposed to re-send the proposal with value Y.
The diagram does not indicate that. Only if Y was seen in response to the blue initial propose messages would the blue proposer have to choose Y. Yet the blue proposer chose X as the value in its ”accept request”. If it is properly following Paxos it could not have ”seen Y” in response to it's initial proposal message. If it had seen it then it must have chosen it and so it wouldn’t have sent X.
In order to really know what is happening you would need to know what responses were seen by each proposer. We cannot see from the diagram what values, if any, were returned in response to the first three blue propose messages. We don’t see in the diagram whether X was previously accepted at any node or whether it was not. We don't know if the blue proposer was ”free to choose” it's own X or had to use an X that was already accepted at one or more nodes.
But what happens if proposer X gets down at that time?
If the blue proposer dies then this is not a problem. The green proposer has successfully fixed the value Y at a majority of the nodes.
How does S1 and S2 eventually learn the value Y?
The more interesting scenario is what happens if the green proposer dies. The green proposer may have sent it's accept request messages containing Y and immediately died. As three of the messages are successful the value Y has been fixed yet the original proposer may not be alive to see the accept response messages. For any further progress to be made a new proposer needs to send a new propose message. As three of the nodes will reply with Y the new proposer will chose Y as the value of it's accept request message. This will be sent to all nodes and if all messages get through, and no other proposer interrupts, then S1 and S2 will become consistent.
The essence of the algorithm is collaboration. If a proposer dies the next proposer will collaborate and chose the highest value previously proposed if any exists.

Kademlia XOR metric properties purposes

In the Kademlia paper by Petar Maymounkov and David Mazières, it is said that the XOR distance is a valid non-Euclidian metric with limited explanations as to why each of the properties of a valid metric are necessary or interesting, namely:
d(x,x) = 0
d(x,y) > 0, if x != y
forall x,y : d(x,y) = d(y,x) -- symmetry
d(x,z) <= d(x,y) + d(y,z) -- triangle inequality
Why is it important for a metric to have these properties in general? Why is each of these properties necessary in the context of routing queries in the Kademlia Distributed Hash Table implementation?
In addition, the paper mentions that unidirectionality (for a given x, and a distance l, there exist only a single y for which d(x,y) = l) guarantees that all queries will converge along the same path. Why is that so?
I can only speak for Kademlia, maybe someone else can provide a more general answer. In the meantime...
d(x,x) = 0
d(x,y) > 0, if x != y
These two points together effectively mean that the closest point to x is x itself; every other point is further away. (This may seem intuitive, but other aspects of the XOR metric aren't.)
In the context of Kademlia, this is important since a lookup for node with ID x will yield that node as the closest. It would be awkward if that were not the case, since a search converging towards x might not find node x.
forall x,y : d(x,y) = d(y,x)
The structure of the Kademlia routing table is such that nodes maintain detailed knowledge of the address space closest to them, and exponentially decreasing knowledge of more distant address space. In short, a node tries to keep all the k closest contacts it hears about.
The symmetry is useful since it means that each of these closest contacts will be maintaining detailed knowledge of a similar part of the address space, rather than a remote part.
If we didn't have this property, it might be helpful to think of the search as more like the hands of a clock moving in one direction round a clockface. The node at 1 o'clock (Node1) is close to Node2 at 2 o'clock (30°), but Node2 is far from Node1 (330°). So imagine we're looking for the two closest to 3 o'clock (i.e. Node1 and Node2). If the search reaches Node2, it won't know about Node1 since it's far away. The whole lookup and topology would have to change.
d(x,z) <= d(x,y) + d(y,z)
If this weren't the case, it would be impossible for a node to know which contacts from its routing table to return during a lookup. It would know the k closest to the target, but there would be no guarantee that one of the other more distant contacts wouldn't yield a shorter overall path.
Because of this property and unidirectionality, different searches starting from vastly separated points will tend to converge down the same path.
The unidirectionality means that no two nodes can have the same distance from a given point. If that weren't the case, then the target point could be encircled by a bunch of nodes all the same distance from it. Then various different searches would be free pick any of those to pass through. However, unidirectionality guarantees that exactly one of this bunch will be the closest, and any search which chooses between this group will always select the same one.
I've been bashing my head on this for quite some time: how can the XOR - as in the number of differing bits, a proper Hamming distance - be the basis of a total order?
Well it can't, such a metric on its own is not enough for a comparable relationship, all it can do is dump nodes in circles around a point.
Then I read the paper more closely and noticed that it says "the XOR as an integer value" and it dawned on me: the crux is not the "XOR metric", but the length of the common prefix of the ID (of which XOR is a derivation mechanism.)
Take two nodes with the same Hamming distance from "self" and the length of their prefix common to "self": the one with shortest common prefix is the furthest node.
The paper uses "XOR distance metric" but it really should read "ID prefix length total ordering"
I think this may explain it a wee bit, let me know http://metaquestions.me/2014/08/01/shortest-distance-between-two-points-is-not-always-a-straight-line/
Basically each hop if it were only one bit at a time in a fully populated network (extreme) then would have twice the knowledge of the previous hop. As you converge the knowledge is greater until you get to the closest nodes whose knowledge is ultimate in the network.

Existence of a 0- and 1-valent configurations in the proof of FLP impossibility result

In the known paper Impossibility of Distributed Consensus with one Faulty Process (JACM85), FLP (Fisher, Lynch and Paterson) proved the surprising result that no completely asynchronous consensus protocol can tolerate even a single unannounced process death.
In Lemma 3, after showing that D contains both 0-valent and 1-valent configurations, it says:
Call two configurations neighbors if one results from the other in a single step. By an easy induction, there exist neighbors C₀, C₁ ∈ C such that Dᵢ = e(Cᵢ) is i-valent, i = 0, 1.
I can follow the whole proof except when they claim the existence of such C₀ and C₁. Could you please give me some hints?
D (the set of possible configurations after applying e to elements of C) contains both 0-valent and 1-valent configurations (and is assumed to contain no bivalent configurations).
That is — e maps every element in C to either a 0-valent or a 1-valent configuration. By definition of C, there must be a root element that is connected to all other elements by a series of "neighbour" relationships, so there must be a boundary point where an element in C that leads to a 0-valent configuration after e is neighbours with an element in C that leads to a 1-valent configuration after e.
I once went down the path of reading all these papers only to discover its a complete waste of time.
The result is not surprising at all.
The paper you mention "[Impossibility of Distributed Consensus with One Faulty
Process]" 1
is a long list of complex mathematical proofs that simply equate to:
1) Consensus is a deterministic state
2) one (or more) faulty systems within an environment is a non deterministic environment
3) No deterministic state, action or outcome can ever be reached within a non deterministic environment.
The end. No further thought is required.
This is how it works in the real world outside of acadamia.
If you wish for agents to reach consensus then Synchronous (Timing model) approximation constructs have to be added to make the environment deterministic within a given set of constraints. For example simple constructs like Timeouts, Ack/Nack, Handshake, Witness, or way more complex constructs.
The closer you wish to get to a Synchronous deterministic model the more complex the constructs become. A hypothetical Synchronous model would have infinitely complex constructs. Also bearing in mind that a fully deterministic Synchronous model can never be achieved in a non trivial distributed system. This is because in any non trivial dynamic multi variate system with a variable initial state there exists an infinite number of possible states, actions and outcomes at any point in time. Chaos Theory
Consider the complexity of a construct for detecting a dropped TCP packets because of buffer overflow errors in a router at hop number 21. And the complexity of detecting the same buffer overflow error dropping the detection signal from the construct itself.
Define a mapping f such that f(C) = 0, if e(C) is 0-valent, otherwise, f(C) = 1, if e(C) is 1-valent.
Because e(C) could not be bivalent, if we assume that D has no bivalent configuration, f(C) could only be either 0 or 1.
Arrange accessible configurations from the initial bivalent configuration in a tree, there must be two neighbors C0, C1 in the tree that f(C0) != f(C1). Because, if not, all f(C) are the same, which means that D has only either all 0-valent configurations or all 1-valent configurations.

redundant encoding?

This is more of a computer science / information theory question than a straightforward programming one, so if anyone knows of a better site to post this, please let me know.
Let's say I have an N-bit piece of data that will be sent redundantly in M messages, where at least M-1 of those messages will be received successfully. I am interested in different ways of encoding the N-bit piece of data in fewer bits per message. (this is similar to RAID but at a much smaller level, where N = 8 or 16 or 32)
Example: suppose N = 16 and M = 4. Then I could use the following algorithm:
1st and 3rd message: send "0" + bits 0-7
2nd and 4th message: send "1" + bits 8-15
If I can guarantee that 3 messages of the 4 will get through, then at least one message from each group will get through. Thus I can make this work with 9 bits or less, there's probably a way to do this with fewer total bits but I'm not sure how.
Are there some simple encoding/decoding algorithms to do this kind of thing? Does this problem have a name? (if I know what it's called, I can google it!)
note: in my particular case, the messages either arrive correctly or do not arrive at all (no messages arrive with errors).
(edit: moved 2nd part to a separate question)
(Incomplete answer follows. I may add more later.)
The term you may be interested in is channel coding: adding redundancy to a source in order to make it robust during transmission over a noisy channel. In information theory, the complementary problem to channel coding is source coding: reducing the redundancy in a source to represent it using fewer bits. (The combination of these two problems is called joint source-channel coding.)
Your first question asks to find a channel code. The simple example you give is similar to a repetition code, i.e., you send the same message more than twice (usually an odd number of times), and then the message which is received most often is accepted as the original message.
This code is inefficient. To use standard notation, let k = number of bits in original message, and n = number of bits in the transmitted message. For your example, k = 16 and n = 36. A measure of coding efficiency is k/n, where higher means more efficient. In your case, k/n = 0.44. This is low.
The repetition code is a simple kind of block code, i.e., redundancy is added to each block of k bits to create a codeword of n bits. So are the Hamming and Reed-Solomon codes as others mentioned. Hamming codes are relatively easy to understand with some basic linear algebra.
These should be enough terms for you to search on your own. Good luck.
I'm not sure if I understood all the details of your question correctly, but your problem is definitely aboud designing some kind of error correcting code. This is a vast area of computer science and thick tomes have been written about it. Start with wikipedia and see if you can get any simple schemes (like Hamming or Reed-Solomon codes) to work in your case.
If you want to deal not only with symbol corruption, but also deletion of symbols, you should look at erasure codes, this is definitely a more difficult task but good methods exist in many cases.
EDIT: This material from hackersdelight.org seems a nice introduction.
See erasure codes.
You're looking for a packet erasure code. There are only two useful packet erasure codes that are not totally encumbered by patents, and there's only one open-source library to implement those. Find it here: http://planete-bcast.inrialpes.fr/rubrique.php3?id_rubrique=5
Here's a trivially simple scheme that's almost twice as efficient as your example.
You chopped the message into blocks of (N/M)*2 bits. Instead, chop it into N/(M-1)-bit blocks. (Round it up if necessary.) The first block, src[0], encodes as itself: enc[0]=src[0]. The same for the last block: enc[M-1]=src[M-1]. Each of the other blocks gets XORed with its left neighbor: enc[i]=src[i-1]^src[i].
Prefix each encoded block with a log(M)-bit sequence number, essentially as you did, so the receiver can tell which was dropped. (If you can be sure that whichever blocks arrive will arrive in order, then a 1-bit sequence number will do. Just alternate 0 and 1.)
To decode, successively XOR from the left and the right until you hit the dropped block. E.g. src[1] == enc[0]^enc[1]. (Dropping one of the endpoint blocks isn't a special case -- e.g. if the first block is dropped, the scan from the right recovers it, and the scan from the left is of length 0.)