What does asynchronous communication in FLP impossibility comprise of? - distributed-computing

Conditions:
(i) Is it just async networks, that is, messages might be arbitrarily delayed?
(ii) Does it also require that the processors should be running at different speeds?
(iii) Is the absence of a global clock also assumed?
I know that condition (i) is surely required. But are conditions (ii) and (iii) also a must? Would the impossibility not hold say if a synchronized clock could be assumed with timeouts?

Related

In Paxos, why can't we use random backoff to avoid collision?

I understand that the heart of Paxos consensus algorithm is that there is only one "majority" in any given set of nodes, therefore if a proposer gets accepted by a majority, there cannot be another majority that accepts a different value, given that any acceptor can only accept 1 single value.
So the simplest "happy path" of a consensus algorithm is just for any proposer to ping a majority of acceptors and see if it can get them to accept its value, and if so, we're done.
The collision comes when concurrent proposers leads to a case where no majority of nodes agrees on a value, which can be demonstrated with the simplest case of 3 nodes, and every node tries to get 2 nodes to accept its value but due to concurrency, every node ends up only get itself to "accept" the value, and therefore no majority agrees on anything.
Paxos algorithm continues to invent a 2-phase algorithm to solve this problem.
But why can't we just simply backoff a random amount of time and retry, until eventually one proposer will succeed in grabbing a majority opinion? This can be demonstrated to succeed eventually, since every proposer will backoff a random amount of time if it fails to grab a majority.
I understand that this is not going to be ideal in terms of performance. But let's get performance out of the way first and only look at the correctness. Is there anything I'm missing here? Is this a correct (basic) consensus algorithm at all?
The designer of paxos is a Mathematician first, and he leaves the engineering to others.
As such, Paxos is designed for the general case to prove consensus is always safe, irrespective of any message delays or colliding back-offs.
And now the sad part. The FLP impossibility result is a proof that any system with this guarantee may run into an infinite loop.
Raft is also designed with this guarantee and thus the same mathematical flaw.
But, the author of Raft also made design choices to specialize Paxos so that an engineer could read the description and make a well-functioning system.
One of these design choices is the well-used trick of exponential random backoff to get around the FLP result in a practical way. This trick does not take away the mathematical possibility of an infinite loop, but does make its likelihood extremely, ridiculously, very small.
You can tack on this trick to Paxos itself, and get the same benefit (and as a professional Paxos maintainer, believe me we do), but then it is not Pure Paxos.
Just to reiterate, the Paxos protocol was designed to be in its most basic form SO THAT mathematicians could prove broad statements about consensus. Any practical implementation details are left to the engineers.
Here is a case where a liveness issue in RAFT caused a 6-hour outage: https://decentralizedthoughts.github.io/2020-12-12-raft-liveness-full-omission/.
Note 1: Yes, I said that the Raft author specialized Paxos. Raft can be mapped onto the more general Vertical Paxos model, which in turn can be mapped onto the Paxos model. As can any system that implements consensus.
Note 2: I have worked with Lamport a few times. He is well aware of these engineering tricks, and he assumes everyone else is, too. Thus he focuses on the math of the problem in his papers, and not the engineering.
The logic you are describing is how leader election is implemented in Raft:
when there is no leader (or leader goes offline) every node will have a random delay
after the random delay, the node will contact every other node and propose "let me be the leader"
if the node gets the majority of votes, then the node considers itself the leader: which is equivalent of saying "the cluster got the consensus on who is the leader"
if the node did not get the majority, then after a timeout and a random delay, the node will attempt again
Raft also has a concept of term, but on a high level, the randomized waits is the feature with helps to get to consensus faster.
Answering your questions "why can't we..." - we can, it will be a different protocol.

Out of Order Execution, How to Solve True Dependency?

I was reading about OOOE (Out of Order Execution) and read about how we can solve false dependencies (By using renaming).
But my question is, how can we solve true dependency (RAW - read after write)?
For example:
R1=R2+R3 #1
R1=R4+R5 #2
R9=R1 #3
Renaming won't be helpful here in case CPU chose to run #2 before #1.
There is no way to really avoid them, that's why RAW hazards are called true dependencies. Instructions have to wait for their inputs to be ready before they can execute. (With OoO exec, normally CPUs will dispatch the oldest-ready-first instructions / uops to execution units, for example on Intel CPUs.)
True dependencies aren't something you "solve" in the sense of making them go away, they're the essence of computation, the way multiple computations on the same numbers are glued together to form an algorithm. Other hazards (WAR and WAW) are just implementation details, reusing the same architectural register for something different.
Sometimes you can structure an algorithm to have shorter dependency chains, once things are already nailed down into machine code, the CPU pretty much just has to respect them, with at best out-of-order exec to overlap independent dep chains.
For loads, in theory there's value-prediction, but AFAIK no real CPU is doing that. Mispredictions are expensive, just like for branches. That's why you'd only want to consider that for high-latency stuff like loads that miss in cache, otherwise the gains won't outweigh the costs. (Even then, it's not done because the gains don't outweigh the costs even for loads, including the power / area cost of building a predictor.) As Paul Clayton points out, branch prediction is a form of value prediction (predicting the condition the load was based on). The more instructions you can keep in flight at once with OoO exec, the more you stand to lose from mispredicts, but real CPUs do predict / speculate for memory disambiguation (whether a load reloads an earlier store to an unknown address or not), and (on CPUs like x86 with strongly-ordered memory models) speculating that early loads will turn out to be allowed; as well as the well known case of control dependencies (branch prediction + speculative execution).
The only thing that helps directly with dependency chains is keeping instruction latencies low. e.g. in the case of your example, #3 is just a register copy, which modern x86 CPUs can do with zero latency (mov-elimination), so another instruction dependent on R9 wouldn't have to wait an extra cycle beyond #2 producing a result, because it's handled during register renaming instead of by an execution unit reading an input and producing an output the normal way.
Obviously bypass forwarding from the outputs of execution units to the inputs of the same or others is essential to keep latency low, same as in an in-order classic RISC pipeline.
Or more conventionally, by improving execution units, like AMD Bulldozer family had 2-cycle latency for most SIMD integer instructions, but that improved to 1 cycle for AMD's next design, Zen. (Scalar integer stuff like add was always 1 cycle on any sane high-performance CPU.)
OoO exec with a large enough window size lets you overlap multiple dep chains in parallel (as in this experiment, and of course software should aim to have enough instruction-level parallelism (ILP) for the CPU to find and exploit. (See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an example of doing that by summing into multiple accumulators for a dot-product.)
This is also useful for in-order CPUs if done statically by a compiler, where techniques like "software pipelining" are a big deal to overlap execution of multiple loop iterations because HW isn't finding that parallelism for you. Or for OoO exec CPUs with a limited window size, for loops with long but not loop-carried dependency chains within each iteration.
As long as you're bottlenecked on something other than latency / dependency chains, true dependencies aren't the problem. e.g. a front-end bottleneck, ideally maxed out at the pipeline width, and/or all relevant back-end execution units busy every cycle, mean that you couldn't get more work through the pipeline even if it was independent.
Of course, in a lot of code there are enough dependencies, including through memory, to not reach that ideal situation.
Simultaneous Multithreading (SMT) can help to keep the back-end fed with work to do, increasing throughput by having the front-end of one physical core read multiple instruction streams, acting as multiple logical cores. This effectively creates ILP out of thread-level parallelism, which is useful if software can scale efficiently to more threads, exposing enough TLP to keep all the logical cores busy.
Related:
Modern Microprocessors A 90-Minute Guide!
How many CPU cycles are needed for each assembly instruction? - that's not how it works on superscalar OoO exec CPUs; latency or throughput or a front-end bottleneck might be the relevant thing.
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?

What do matrix clocks solve but vector clocks can't?

I understand the need for vector clocks in terms of scalar logical clocks failing to provide enough information to tell whether there is an update conflict in a key value store update for example.
But I am not sure what problem is still unsolved by vector clocks and then solved by the more bulky matrix clocks?
In an eventual consistency environment all messages ever created by a system need to be kept until every peer has received the message (== eventual consistency). But you don't want to keep messages for ever, so you need to have a way to tell which messages were received by all nodes and can be deleted, this is why you use matrix clocks.
Matrix clocks are a list of vector clocks, so you know the current state of each node in the system. Based on this you can know which peer received already which messages. When you exchange messages with another node in the system you compare the matrix clocks and remember always the highest values for each node. Afterwards you can delete messages which were sent before, because the node already must have received them.
This is a very brief description of TSAE (timestamped anti-entropy) protocol. You can read more about it in the dissertation project Weak-consistency group communication and membership by Richard Andrew Golding from 1992 (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.7385&rep=rep1&type=pdf) starting from chapter 5.
The distinctions among Lamport clock (scalar logical clock, in your term), vector clock, and matrix clock lie in that they represent different levels of knowledge.
For vector clock $vt_i[1 \ldots n]$ in site $i$, the entry $vt_i[k]$ represents the knowledge the site $S_i$ has about site $S_k$. The knowledge has the form of "$i$ knows $k$ that $\ldots$".
For matrix clock $mt_i[1 \ldots n, 1 \ldots n]$ in site $S_i$, the entry $mt_i[k,l]$ represents the knowledge the site $S_i$ has about the knowledge by $S_k$ about site $S_l$. The knowledge here the form of "$i$ knows $k$ knows $l$ that $\ldots$".
Intuitively, we can do more things with more knowledge.
The following description is mainly quoted from [1]:
Vector clocks and matrix clocks are widely used in asynchronous distributed message-passing systems.
Some example areas using vector clocks are checkpointing, causal memory, maintaining consistency of replicated files, global snapshot, global time approximation, termination detection, bounded multiwriter construction of shared variables, mutual exclusion and debugging (predicate detection).
Some example areas that use matrix clocks are designing fault-tolerant protocols and distributed database protocols, including protocols to discard obsolete information in distributed databases, and protocols to solve the replicated log and replicated dictionary problems.
For matrix clock, we notice that
$min_k(mt_i[k,i]) \ge t$ means that site $S_i$ knows that every other site $k$ knows its progress till its local time $t$.
It is this property that allows a site to no longer send an information with a local time $\le t$ or to discard obsolete information.
[1] Concurrent Knowledge and Logical Clock Abstractions Ajay D. Kshemkalyani 2000

Existence of a 0- and 1-valent configurations in the proof of FLP impossibility result

In the known paper Impossibility of Distributed Consensus with one Faulty Process (JACM85), FLP (Fisher, Lynch and Paterson) proved the surprising result that no completely asynchronous consensus protocol can tolerate even a single unannounced process death.
In Lemma 3, after showing that D contains both 0-valent and 1-valent configurations, it says:
Call two configurations neighbors if one results from the other in a single step. By an easy induction, there exist neighbors C₀, C₁ ∈ C such that Dᵢ = e(Cᵢ) is i-valent, i = 0, 1.
I can follow the whole proof except when they claim the existence of such C₀ and C₁. Could you please give me some hints?
D (the set of possible configurations after applying e to elements of C) contains both 0-valent and 1-valent configurations (and is assumed to contain no bivalent configurations).
That is — e maps every element in C to either a 0-valent or a 1-valent configuration. By definition of C, there must be a root element that is connected to all other elements by a series of "neighbour" relationships, so there must be a boundary point where an element in C that leads to a 0-valent configuration after e is neighbours with an element in C that leads to a 1-valent configuration after e.
I once went down the path of reading all these papers only to discover its a complete waste of time.
The result is not surprising at all.
The paper you mention "[Impossibility of Distributed Consensus with One Faulty
Process]" 1
is a long list of complex mathematical proofs that simply equate to:
1) Consensus is a deterministic state
2) one (or more) faulty systems within an environment is a non deterministic environment
3) No deterministic state, action or outcome can ever be reached within a non deterministic environment.
The end. No further thought is required.
This is how it works in the real world outside of acadamia.
If you wish for agents to reach consensus then Synchronous (Timing model) approximation constructs have to be added to make the environment deterministic within a given set of constraints. For example simple constructs like Timeouts, Ack/Nack, Handshake, Witness, or way more complex constructs.
The closer you wish to get to a Synchronous deterministic model the more complex the constructs become. A hypothetical Synchronous model would have infinitely complex constructs. Also bearing in mind that a fully deterministic Synchronous model can never be achieved in a non trivial distributed system. This is because in any non trivial dynamic multi variate system with a variable initial state there exists an infinite number of possible states, actions and outcomes at any point in time. Chaos Theory
Consider the complexity of a construct for detecting a dropped TCP packets because of buffer overflow errors in a router at hop number 21. And the complexity of detecting the same buffer overflow error dropping the detection signal from the construct itself.
Define a mapping f such that f(C) = 0, if e(C) is 0-valent, otherwise, f(C) = 1, if e(C) is 1-valent.
Because e(C) could not be bivalent, if we assume that D has no bivalent configuration, f(C) could only be either 0 or 1.
Arrange accessible configurations from the initial bivalent configuration in a tree, there must be two neighbors C0, C1 in the tree that f(C0) != f(C1). Because, if not, all f(C) are the same, which means that D has only either all 0-valent configurations or all 1-valent configurations.

Difference between Latency and Jitter in Operating-Systems

discussing criterias for Operating-Systems every time I hear Interupt-Latency and OS-Jitter. And now I ask myself, what is the Difference between these two.
In my opinion the Interrupt-Latency is the Delay from occurence of an Interupt until the Interupt-Service-Routine (ISR) is entered.
On the contrary Jitter is the time the moment of entering the ISR differs over time.
Is this the same you think?
Your understanding is basically correct.
Latency = Delay between an event happening in the real world and code responding to the event.
Jitter = Differences in Latencies between two or more events.
In the realm of clustered computing, especially when dealing with massive scale out solutions, there are cases where work distributed across many systems (and many many processor cores) needs to complete in fairly predictable time-frames. An operating system, and the software stack being leveraged, can introduce some variability in the run-times of these "chunks" of work. This variability is often referred to as "OS Jitter". link
Interrupt latency, as you said is the time between interrupt signal and entry into the interrupt handler.
Both the concepts are orthogonal to each other. However, practically, more interrupts generally implies more OS Jitter.