In paxos, what happens if a proposer is down after its proposal is rejected? - distributed-computing

In this figure, the proposal of X is rejected.
At the end of the timeline, S1 and S2 accept X while S3, S4 and S5 accept Y. Proposer X is now supposed to re-send the proposal with value Y.
But what happens if proposer X gets down at that time? How does S1 and S2 eventually learn the value Y?
Thanks in advance!

It is a little hard to answer this from the fragment of a diagram that you've shared since it is not clear what exactly it means. It would be helpful if you could link to the source of that diagram so we can see more of the context of your question. The rest of this answer is based on a guess as to its meaning.
There are three distinct roles in Paxos, commonly known as proposer, acceptor and learner, and I think it aids understanding to divide things into these three roles. The diagram you've shared looks like it is illustrating a set of five acceptors and the messages that they have sent as part of the basic Synod algorithm (a.k.a. single-instance Paxos). In general there's no relationship between the sets of learners and acceptors in a system: there might be a single learner, or there might be thousands, and I think it helps to separate these concepts out. Since S1 and S2 are acceptors, not learners, it doesn't make sense to ask about them learning a value. It is, however, valid to ask about how to deal with a learner that didn't learn a value.
In practical systems there is usually also another role of leader which takes responsibility for pushing the system forward using timeouts and retries and fault detectors and so on, to ensure that all learners eventually learn the chosen value or die trying, but this is outside the scope of the basic algorithm that seems to be illustrated here. In other words, this algorithm guarantees safety ("nothing bad happens") but does not guarantee liveness ("something good happens"). It is acceptable here if some of the learners never learn the chosen value.
The leader can do various things to ensure that all learners eventually learn the chosen value. One of the simplest strategies is to get the learned value from any learner and broadcast it to the other learners, which is efficient and works as long as there is at least one running learner that's successfully learned the chosen value. If there is no such learner, the leader can trigger another round of the algorithm, which will normally result in the chosen value being learned. If it doesn't then its only option is to retry, and keep retrying until eventually one of these rounds succeeds.

In this figure, the proposal of X is rejected.
My reading of the diagram is that it is an ”accept request” that is rejected. Page 5 paragraph 1 of Paxos Made Simple describes this message type.
Proposer X is now supposed to re-send the proposal with value Y.
The diagram does not indicate that. Only if Y was seen in response to the blue initial propose messages would the blue proposer have to choose Y. Yet the blue proposer chose X as the value in its ”accept request”. If it is properly following Paxos it could not have ”seen Y” in response to it's initial proposal message. If it had seen it then it must have chosen it and so it wouldn’t have sent X.
In order to really know what is happening you would need to know what responses were seen by each proposer. We cannot see from the diagram what values, if any, were returned in response to the first three blue propose messages. We don’t see in the diagram whether X was previously accepted at any node or whether it was not. We don't know if the blue proposer was ”free to choose” it's own X or had to use an X that was already accepted at one or more nodes.
But what happens if proposer X gets down at that time?
If the blue proposer dies then this is not a problem. The green proposer has successfully fixed the value Y at a majority of the nodes.
How does S1 and S2 eventually learn the value Y?
The more interesting scenario is what happens if the green proposer dies. The green proposer may have sent it's accept request messages containing Y and immediately died. As three of the messages are successful the value Y has been fixed yet the original proposer may not be alive to see the accept response messages. For any further progress to be made a new proposer needs to send a new propose message. As three of the nodes will reply with Y the new proposer will chose Y as the value of it's accept request message. This will be sent to all nodes and if all messages get through, and no other proposer interrupts, then S1 and S2 will become consistent.
The essence of the algorithm is collaboration. If a proposer dies the next proposer will collaborate and chose the highest value previously proposed if any exists.

Related

In Paxos, why can't we use random backoff to avoid collision?

I understand that the heart of Paxos consensus algorithm is that there is only one "majority" in any given set of nodes, therefore if a proposer gets accepted by a majority, there cannot be another majority that accepts a different value, given that any acceptor can only accept 1 single value.
So the simplest "happy path" of a consensus algorithm is just for any proposer to ping a majority of acceptors and see if it can get them to accept its value, and if so, we're done.
The collision comes when concurrent proposers leads to a case where no majority of nodes agrees on a value, which can be demonstrated with the simplest case of 3 nodes, and every node tries to get 2 nodes to accept its value but due to concurrency, every node ends up only get itself to "accept" the value, and therefore no majority agrees on anything.
Paxos algorithm continues to invent a 2-phase algorithm to solve this problem.
But why can't we just simply backoff a random amount of time and retry, until eventually one proposer will succeed in grabbing a majority opinion? This can be demonstrated to succeed eventually, since every proposer will backoff a random amount of time if it fails to grab a majority.
I understand that this is not going to be ideal in terms of performance. But let's get performance out of the way first and only look at the correctness. Is there anything I'm missing here? Is this a correct (basic) consensus algorithm at all?
The designer of paxos is a Mathematician first, and he leaves the engineering to others.
As such, Paxos is designed for the general case to prove consensus is always safe, irrespective of any message delays or colliding back-offs.
And now the sad part. The FLP impossibility result is a proof that any system with this guarantee may run into an infinite loop.
Raft is also designed with this guarantee and thus the same mathematical flaw.
But, the author of Raft also made design choices to specialize Paxos so that an engineer could read the description and make a well-functioning system.
One of these design choices is the well-used trick of exponential random backoff to get around the FLP result in a practical way. This trick does not take away the mathematical possibility of an infinite loop, but does make its likelihood extremely, ridiculously, very small.
You can tack on this trick to Paxos itself, and get the same benefit (and as a professional Paxos maintainer, believe me we do), but then it is not Pure Paxos.
Just to reiterate, the Paxos protocol was designed to be in its most basic form SO THAT mathematicians could prove broad statements about consensus. Any practical implementation details are left to the engineers.
Here is a case where a liveness issue in RAFT caused a 6-hour outage: https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/.
Note 1: Yes, I said that the Raft author specialized Paxos. Raft can be mapped onto the more general Vertical Paxos model, which in turn can be mapped onto the Paxos model. As can any system that implements consensus.
Note 2: I have worked with Lamport a few times. He is well aware of these engineering tricks, and he assumes everyone else is, too. Thus he focuses on the math of the problem in his papers, and not the engineering.
The logic you are describing is how leader election is implemented in Raft:
when there is no leader (or leader goes offline) every node will have a random delay
after the random delay, the node will contact every other node and propose "let me be the leader"
if the node gets the majority of votes, then the node considers itself the leader: which is equivalent of saying "the cluster got the consensus on who is the leader"
if the node did not get the majority, then after a timeout and a random delay, the node will attempt again
Raft also has a concept of term, but on a high level, the randomized waits is the feature with helps to get to consensus faster.
Answering your questions "why can't we..." - we can, it will be a different protocol.

Observation Space for race strategy development - Reinforcement learning

I refrained from asking for help until now, but as my thesis' deadline creeps ever closer and I do not know anybody with experience in RL, I'm trying my luck here.
TLDR;
I have not found an academic/online resource which helps me understand the correct representation of the environment as an observation space. I would be very thankful for any links or for giving me a starting point of how to model the specifics of my environment in an observation space.
Short thematic introduction
The goal of my research is to determine the viability of RL for strategy development in motorsports. This is currently achieved by simulating (lots of!) races and calculating the resulting race time (thus end-position) of different strategic decisions (which are the timing of pit stops + amount of laps to refuel for). This demands a manual input of expected inlaps (the lap a pit stop occurs) for all participants, which implicitly limits the possible strategies by human imagination as well as the amount of possible simulations.
Use of RL
A trained RL agent could decide on its own when to perform a pit stop and how much fuel should be added, in order to minizime the race time and react to probabilistic events in the simulation.
The action space is discrete(4) and represents the options to continue, pit and refuel for 2,4,6 laps respectively.
Problem
The observation space is of POMDP nature and needs to model the agent's current race position (which I hope is enough?). How would I implement the observation space accordingly?
The training is performed using OpenAI's Gym framework, but a general explanation/link to article/publication would also be appreciated very much!
Your observation could be just an integer which represents round or position the agent is in. This is obviously not a sufficient representation so you need to add more information.
A better observation could be the agents race position x1, the round the agent is in x2 and the current fuel in the tank x3. All three of these can be represented by a real number. Then you can create your observation by concating these to a vector obs = [x1, x2, x3].

What do matrix clocks solve but vector clocks can't?

I understand the need for vector clocks in terms of scalar logical clocks failing to provide enough information to tell whether there is an update conflict in a key value store update for example.
But I am not sure what problem is still unsolved by vector clocks and then solved by the more bulky matrix clocks?
In an eventual consistency environment all messages ever created by a system need to be kept until every peer has received the message (== eventual consistency). But you don't want to keep messages for ever, so you need to have a way to tell which messages were received by all nodes and can be deleted, this is why you use matrix clocks.
Matrix clocks are a list of vector clocks, so you know the current state of each node in the system. Based on this you can know which peer received already which messages. When you exchange messages with another node in the system you compare the matrix clocks and remember always the highest values for each node. Afterwards you can delete messages which were sent before, because the node already must have received them.
This is a very brief description of TSAE (timestamped anti-entropy) protocol. You can read more about it in the dissertation project Weak-consistency group communication and membership by Richard Andrew Golding from 1992 (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.7385&rep=rep1&type=pdf) starting from chapter 5.
The distinctions among Lamport clock (scalar logical clock, in your term), vector clock, and matrix clock lie in that they represent different levels of knowledge.
For vector clock $vt_i[1 \ldots n]$ in site $i$, the entry $vt_i[k]$ represents the knowledge the site $S_i$ has about site $S_k$. The knowledge has the form of "$i$ knows $k$ that $\ldots$".
For matrix clock $mt_i[1 \ldots n, 1 \ldots n]$ in site $S_i$, the entry $mt_i[k,l]$ represents the knowledge the site $S_i$ has about the knowledge by $S_k$ about site $S_l$. The knowledge here the form of "$i$ knows $k$ knows $l$ that $\ldots$".
Intuitively, we can do more things with more knowledge.
The following description is mainly quoted from [1]:
Vector clocks and matrix clocks are widely used in asynchronous distributed message-passing systems.
Some example areas using vector clocks are checkpointing, causal memory, maintaining consistency of replicated files, global snapshot, global time approximation, termination detection, bounded multiwriter construction of shared variables, mutual exclusion and debugging (predicate detection).
Some example areas that use matrix clocks are designing fault-tolerant protocols and distributed database protocols, including protocols to discard obsolete information in distributed databases, and protocols to solve the replicated log and replicated dictionary problems.
For matrix clock, we notice that
$min_k(mt_i[k,i]) \ge t$ means that site $S_i$ knows that every other site $k$ knows its progress till its local time $t$.
It is this property that allows a site to no longer send an information with a local time $\le t$ or to discard obsolete information.
[1] Concurrent Knowledge and Logical Clock Abstractions Ajay D. Kshemkalyani 2000

Byzantine's General

So I was reading Lamport's paper on Byzantine Generals in which he proves that for T malicious generals you need 2T+1 generals in a group to read a consensus. However I dont understand how. If there are T malicious nodes making up stuff, you just need T+1 votes to outvote them. Why is that not the case?
There is a section on Wikipedia about this:
One solution considers scenarios in which messages may be forged, but which will be Byzantine-fault-tolerant as long as the number of traitorous generals does not equal or exceed one third. The impossibility of dealing with one-third or more traitors ultimately reduces to proving that the 1 Commander + 2 Lieutenants problem cannot be solved, if the Commander is traitorous. The reason is, if we have three commanders, A, B, and C, and A is the traitor: when A tells B to attack and C to retreat, and B and C send messages to each other, forwarding A's message, neither B nor C can figure out who is the traitor, since it isn't necessarily A – the other commander could have forged the message purportedly from A. It can be shown that if n is the number of generals in total, and t is the number of traitors in that n, then there are solutions to the problem only when n is greater than or equal to 3t + 1
you just need T+1 votes to outvote them. Why is that not the case?
This makes sense if all loyal generals produce the same answer, but that's not the case for BGP systems, where each honest element can give you a different answer.
BGP is for systems where each element sees a different information. Example: redundant radars. It is not for systems where the elements are mirrored (ex. redundant HDs).
Example:
Generals: A, B, C;
Traitor: C;
A says "attack";
B says "retreat";
C says "attack" to A, and "retreat" to "B";
Result: A thinks it has reached agreement and it will attack alone;

redundant encoding?

This is more of a computer science / information theory question than a straightforward programming one, so if anyone knows of a better site to post this, please let me know.
Let's say I have an N-bit piece of data that will be sent redundantly in M messages, where at least M-1 of those messages will be received successfully. I am interested in different ways of encoding the N-bit piece of data in fewer bits per message. (this is similar to RAID but at a much smaller level, where N = 8 or 16 or 32)
Example: suppose N = 16 and M = 4. Then I could use the following algorithm:
1st and 3rd message: send "0" + bits 0-7
2nd and 4th message: send "1" + bits 8-15
If I can guarantee that 3 messages of the 4 will get through, then at least one message from each group will get through. Thus I can make this work with 9 bits or less, there's probably a way to do this with fewer total bits but I'm not sure how.
Are there some simple encoding/decoding algorithms to do this kind of thing? Does this problem have a name? (if I know what it's called, I can google it!)
note: in my particular case, the messages either arrive correctly or do not arrive at all (no messages arrive with errors).
(edit: moved 2nd part to a separate question)
(Incomplete answer follows. I may add more later.)
The term you may be interested in is channel coding: adding redundancy to a source in order to make it robust during transmission over a noisy channel. In information theory, the complementary problem to channel coding is source coding: reducing the redundancy in a source to represent it using fewer bits. (The combination of these two problems is called joint source-channel coding.)
Your first question asks to find a channel code. The simple example you give is similar to a repetition code, i.e., you send the same message more than twice (usually an odd number of times), and then the message which is received most often is accepted as the original message.
This code is inefficient. To use standard notation, let k = number of bits in original message, and n = number of bits in the transmitted message. For your example, k = 16 and n = 36. A measure of coding efficiency is k/n, where higher means more efficient. In your case, k/n = 0.44. This is low.
The repetition code is a simple kind of block code, i.e., redundancy is added to each block of k bits to create a codeword of n bits. So are the Hamming and Reed-Solomon codes as others mentioned. Hamming codes are relatively easy to understand with some basic linear algebra.
These should be enough terms for you to search on your own. Good luck.
I'm not sure if I understood all the details of your question correctly, but your problem is definitely aboud designing some kind of error correcting code. This is a vast area of computer science and thick tomes have been written about it. Start with wikipedia and see if you can get any simple schemes (like Hamming or Reed-Solomon codes) to work in your case.
If you want to deal not only with symbol corruption, but also deletion of symbols, you should look at erasure codes, this is definitely a more difficult task but good methods exist in many cases.
EDIT: This material from hackersdelight.org seems a nice introduction.
See erasure codes.
You're looking for a packet erasure code. There are only two useful packet erasure codes that are not totally encumbered by patents, and there's only one open-source library to implement those. Find it here: http://planete-bcast.inrialpes.fr/rubrique.php3?id_rubrique=5
Here's a trivially simple scheme that's almost twice as efficient as your example.
You chopped the message into blocks of (N/M)*2 bits. Instead, chop it into N/(M-1)-bit blocks. (Round it up if necessary.) The first block, src[0], encodes as itself: enc[0]=src[0]. The same for the last block: enc[M-1]=src[M-1]. Each of the other blocks gets XORed with its left neighbor: enc[i]=src[i-1]^src[i].
Prefix each encoded block with a log(M)-bit sequence number, essentially as you did, so the receiver can tell which was dropped. (If you can be sure that whichever blocks arrive will arrive in order, then a 1-bit sequence number will do. Just alternate 0 and 1.)
To decode, successively XOR from the left and the right until you hit the dropped block. E.g. src[1] == enc[0]^enc[1]. (Dropping one of the endpoint blocks isn't a special case -- e.g. if the first block is dropped, the scan from the right recovers it, and the scan from the left is of length 0.)