In Paxos, why can't we use random backoff to avoid collision? - distributed-computing

I understand that the heart of Paxos consensus algorithm is that there is only one "majority" in any given set of nodes, therefore if a proposer gets accepted by a majority, there cannot be another majority that accepts a different value, given that any acceptor can only accept 1 single value.
So the simplest "happy path" of a consensus algorithm is just for any proposer to ping a majority of acceptors and see if it can get them to accept its value, and if so, we're done.
The collision comes when concurrent proposers leads to a case where no majority of nodes agrees on a value, which can be demonstrated with the simplest case of 3 nodes, and every node tries to get 2 nodes to accept its value but due to concurrency, every node ends up only get itself to "accept" the value, and therefore no majority agrees on anything.
Paxos algorithm continues to invent a 2-phase algorithm to solve this problem.
But why can't we just simply backoff a random amount of time and retry, until eventually one proposer will succeed in grabbing a majority opinion? This can be demonstrated to succeed eventually, since every proposer will backoff a random amount of time if it fails to grab a majority.
I understand that this is not going to be ideal in terms of performance. But let's get performance out of the way first and only look at the correctness. Is there anything I'm missing here? Is this a correct (basic) consensus algorithm at all?

The designer of paxos is a Mathematician first, and he leaves the engineering to others.
As such, Paxos is designed for the general case to prove consensus is always safe, irrespective of any message delays or colliding back-offs.
And now the sad part. The FLP impossibility result is a proof that any system with this guarantee may run into an infinite loop.
Raft is also designed with this guarantee and thus the same mathematical flaw.
But, the author of Raft also made design choices to specialize Paxos so that an engineer could read the description and make a well-functioning system.
One of these design choices is the well-used trick of exponential random backoff to get around the FLP result in a practical way. This trick does not take away the mathematical possibility of an infinite loop, but does make its likelihood extremely, ridiculously, very small.
You can tack on this trick to Paxos itself, and get the same benefit (and as a professional Paxos maintainer, believe me we do), but then it is not Pure Paxos.
Just to reiterate, the Paxos protocol was designed to be in its most basic form SO THAT mathematicians could prove broad statements about consensus. Any practical implementation details are left to the engineers.
Here is a case where a liveness issue in RAFT caused a 6-hour outage: https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/.
Note 1: Yes, I said that the Raft author specialized Paxos. Raft can be mapped onto the more general Vertical Paxos model, which in turn can be mapped onto the Paxos model. As can any system that implements consensus.
Note 2: I have worked with Lamport a few times. He is well aware of these engineering tricks, and he assumes everyone else is, too. Thus he focuses on the math of the problem in his papers, and not the engineering.

The logic you are describing is how leader election is implemented in Raft:
when there is no leader (or leader goes offline) every node will have a random delay
after the random delay, the node will contact every other node and propose "let me be the leader"
if the node gets the majority of votes, then the node considers itself the leader: which is equivalent of saying "the cluster got the consensus on who is the leader"
if the node did not get the majority, then after a timeout and a random delay, the node will attempt again
Raft also has a concept of term, but on a high level, the randomized waits is the feature with helps to get to consensus faster.
Answering your questions "why can't we..." - we can, it will be a different protocol.

Related

How is the bias node integrated in NEAT?

In NEAT you can add a special bias input node that is always active. Regarding the implementation of such a node there is not much information in the original paper. Now I want to know how the bias node should behave, if there is a at all a consensus.
So the question is:
Do connections from the bias node come about during evolution and can be split for new nodes just like regular connections or does the bias node always have connections to all non-input nodes?
To answer my own question: According to the NEAT users page Kenneth O. Stanley talks about why the bias in NEAT is used as an extra input neuron:
Why does NEAT use a bias node instead of having a bias parameter in each node?
Mainly because not all nodes need a bias. Thus, it would unnecessarily enlarge the search space to be searching for a proper bias for every node in the system. Instead, we let evolution decide which nodes need biases by connecting the bias node to those nodes. This issue is not a major concern; it could work either way. You can easily code a bias into every node and try that as well.
My best guess is therefore that the BIAS input is treated like any other input in NEAT, with the difference that it is always active.

In paxos, what happens if a proposer is down after its proposal is rejected?

In this figure, the proposal of X is rejected.
At the end of the timeline, S1 and S2 accept X while S3, S4 and S5 accept Y. Proposer X is now supposed to re-send the proposal with value Y.
But what happens if proposer X gets down at that time? How does S1 and S2 eventually learn the value Y?
Thanks in advance!
It is a little hard to answer this from the fragment of a diagram that you've shared since it is not clear what exactly it means. It would be helpful if you could link to the source of that diagram so we can see more of the context of your question. The rest of this answer is based on a guess as to its meaning.
There are three distinct roles in Paxos, commonly known as proposer, acceptor and learner, and I think it aids understanding to divide things into these three roles. The diagram you've shared looks like it is illustrating a set of five acceptors and the messages that they have sent as part of the basic Synod algorithm (a.k.a. single-instance Paxos). In general there's no relationship between the sets of learners and acceptors in a system: there might be a single learner, or there might be thousands, and I think it helps to separate these concepts out. Since S1 and S2 are acceptors, not learners, it doesn't make sense to ask about them learning a value. It is, however, valid to ask about how to deal with a learner that didn't learn a value.
In practical systems there is usually also another role of leader which takes responsibility for pushing the system forward using timeouts and retries and fault detectors and so on, to ensure that all learners eventually learn the chosen value or die trying, but this is outside the scope of the basic algorithm that seems to be illustrated here. In other words, this algorithm guarantees safety ("nothing bad happens") but does not guarantee liveness ("something good happens"). It is acceptable here if some of the learners never learn the chosen value.
The leader can do various things to ensure that all learners eventually learn the chosen value. One of the simplest strategies is to get the learned value from any learner and broadcast it to the other learners, which is efficient and works as long as there is at least one running learner that's successfully learned the chosen value. If there is no such learner, the leader can trigger another round of the algorithm, which will normally result in the chosen value being learned. If it doesn't then its only option is to retry, and keep retrying until eventually one of these rounds succeeds.
In this figure, the proposal of X is rejected.
My reading of the diagram is that it is an ”accept request” that is rejected. Page 5 paragraph 1 of Paxos Made Simple describes this message type.
Proposer X is now supposed to re-send the proposal with value Y.
The diagram does not indicate that. Only if Y was seen in response to the blue initial propose messages would the blue proposer have to choose Y. Yet the blue proposer chose X as the value in its ”accept request”. If it is properly following Paxos it could not have ”seen Y” in response to it's initial proposal message. If it had seen it then it must have chosen it and so it wouldn’t have sent X.
In order to really know what is happening you would need to know what responses were seen by each proposer. We cannot see from the diagram what values, if any, were returned in response to the first three blue propose messages. We don’t see in the diagram whether X was previously accepted at any node or whether it was not. We don't know if the blue proposer was ”free to choose” it's own X or had to use an X that was already accepted at one or more nodes.
But what happens if proposer X gets down at that time?
If the blue proposer dies then this is not a problem. The green proposer has successfully fixed the value Y at a majority of the nodes.
How does S1 and S2 eventually learn the value Y?
The more interesting scenario is what happens if the green proposer dies. The green proposer may have sent it's accept request messages containing Y and immediately died. As three of the messages are successful the value Y has been fixed yet the original proposer may not be alive to see the accept response messages. For any further progress to be made a new proposer needs to send a new propose message. As three of the nodes will reply with Y the new proposer will chose Y as the value of it's accept request message. This will be sent to all nodes and if all messages get through, and no other proposer interrupts, then S1 and S2 will become consistent.
The essence of the algorithm is collaboration. If a proposer dies the next proposer will collaborate and chose the highest value previously proposed if any exists.

How should one set up the immediate reward in a RL program?

I want my RL agent to reach the goal as quickly as possible and at the same time to minimize the number of times it uses a specific resource T (which sometimes though is necessary).
I thought of setting up the immediate rewards as -1 per step, an additional -1 if the agent uses T and 0 if it reaches the goal.
But the additional -1 is completely arbitrary, how do I decide how much punishment should the agent get for using T?
You should use a reward function which mimics your own values. If the resource is expensive (valuable to you), then the punishment for consuming it should be harsh. The same thing goes for time (which is also a resource if you think about it).
If the ratio between the two punishments (the one for time consumption and the one for resource consumption) is in accordance to how you value these resources, then the agent will act precisely in your interest. If you get it wrong (because maybe you don't know the precise cost of the resource nor the precise cost of slow learning), then it will strive for a pseudo optimal solution rather than an optimal one, which in a lot of cases is okay.

For a Single Cycle CPU How Much Energy Required For Execution Of ADD Command

The question is obvious like specified in the title. I wonder this. Any expert can help?
OK, this is was going to be a long answer, so long that I may write an article about it instead. Strangely enough, I've been working on experiments that are closely related to your question -- determining performance per watt for a modern processor. As Paul and Sneftel indicated, it's not really possible with any real architecture today. You can probably compute this if you are looking at only the execution of that instruction given a certain silicon technology and a certain ALU design through calculating gate leakage and switching currents, voltages, etc. But that isn't a useful value because there is something always going on (from a HW perspective) in any processor newer than an 8086, and instructions haven't been executed in isolation since a pipeline first came into being.
Today, we have multi-function ALUs, out-of-order execution, multiple pipelines, hyperthreading, branch prediction, memory hierarchies, etc. What does this have to do with the execution of one ADD command? The energy used to execute one ADD command is different from the execution of multiple ADD commands. And if you wrap a program around it, then it gets really complicated.
SORT-OF-AN-ANSWER:
So let's look at what you can do.
Statistically measure running a given add over and over again. Remember that there are many different types of adds such as integer adds, floating-point, double precision, adds with carries, and even simultaneous adds (SIMD) to name a few. Limits: OSs and other apps are always there, though you may be able to run on bare metal if you know how; varies with different hardware, silicon technologies, architecture, etc; probably not useful because it is so far from reality that it means little; limits of measurement equipment (using interprocessor PMUs, from the wall meters, interposer socket, etc); memory hierarchy; and more
Statistically measuring an integer/floating-point/double -based workload kernel. This is beginning to have some meaning because it means something to the community. Limits: Still not real; still varies with architecture, silicon technology, hardware, etc; measuring equipment limits; etc
Statistically measuring a real application. Limits: same as above but it at least means something to the community; power states come into play during periods of idle; potentially cluster issues come into play.
When I say "Limits", that just means you need to well define the constraints of your answer / experiment, not that it isn't useful.
SUMMARY: it is possible to come up with a value for one add but it doesn't really mean anything anymore. A value that means anything is way more complicated but is useful and requires a lot of work to find.
By the way, I do think it is a good and important question -- in part because it is so deceptively simple.

Difference between Latency and Jitter in Operating-Systems

discussing criterias for Operating-Systems every time I hear Interupt-Latency and OS-Jitter. And now I ask myself, what is the Difference between these two.
In my opinion the Interrupt-Latency is the Delay from occurence of an Interupt until the Interupt-Service-Routine (ISR) is entered.
On the contrary Jitter is the time the moment of entering the ISR differs over time.
Is this the same you think?
Your understanding is basically correct.
Latency = Delay between an event happening in the real world and code responding to the event.
Jitter = Differences in Latencies between two or more events.
In the realm of clustered computing, especially when dealing with massive scale out solutions, there are cases where work distributed across many systems (and many many processor cores) needs to complete in fairly predictable time-frames. An operating system, and the software stack being leveraged, can introduce some variability in the run-times of these "chunks" of work. This variability is often referred to as "OS Jitter". link
Interrupt latency, as you said is the time between interrupt signal and entry into the interrupt handler.
Both the concepts are orthogonal to each other. However, practically, more interrupts generally implies more OS Jitter.