Are users punished for bidding for wrong fork in Ouroboros algorithm? - cardano

I am confused by the description of PoS algorithm here https://hackernoon.com/a-hitchhikers-guide-to-consensus-algorithms-d81aae3eb0e3
In PoS, the blocks aren’t created by miners doing work, but by minters
staking their tokens to “bet” on which blocks are valid. In the case
of a fork, minters spend their tokens voting on which fork to support.
Assuming most people vote on the correct fork, validators who voted on
the wrong fork would “lose their stake” in the correct one.
Is this how Ouroboros algorithm works?

No.
A users stake is not directly affected by the staking process in any variant of the Ouroboros protocol. In practice, if a user extends the "wrong fork", they simply end up not getting any rewards for this block down the line.
Slashing algorithms are not necessary for Ouroboros, as it employs cryptography and probabalistic analysis to rule out the attacks it is designed to prevent.
Even if it were necessary, however, typically it comes in the form of punishing provably bad behaviour, and not honest "mistakes" (of which extending a shorter chain is one). Specifically, the variants I've seen will punish users if they create two blocks at the same point, i.e. they actively fork the chain.

In the Ouroboros Proof of Stake protocol, there is no penalty for betting on the wrong arm of a fork. And this does not only apply to deliberate, malicious fork attempts.
In Ouroboros it happens constantly and "by design" that two block producers each produce a block for the same slot. This happens because the allocation of a slot happens based on mathematical probabilities and the weight of the stake delegations. You can think of it as rolling dice. For example, a small pool must roll a 1 to become a slot leader. A larger pool is empowered to fill a slot if it rolls a 1 or 2. (the probabilities are of course much lower than described here on the basis of a 1-6 dice).
Each pool constantly rolls its own secret VRF key, and thus independently determines when it is slot leader. (in the Cardano mainnet which uses Ouroboros PoS it is a 2 second cycle) All other pools can check if this is a legitimate block when they receive the new block using the public VRF key.
Now, in some cases (about 5% of all blocks), chance may make two or even more stake pools legitimate slot leaders. So here several legitimate blocks are created without any bad intention at all.
However, Cardano's Ouroboros implementation has implemented a relatively fair solution for this, which allows a quick resolution of this fork. The winner is the block whose creator rolled the lowest number. This is a slight advantage for small pools, because they basically have to roll low to become the leader at all. But on average a pool loses about half (2.5%) of all generated blocks in these so-called slot battles because of the random collisions.
However, it is important that every node in the network can immediately determine the correct valid block without further data and time loss and that there is consensus on this. So these (by design) forks can be solved without much impact, and there is no need at all to punish anyone for this, especially not by slashing the participating tokens.

Related

How to distribute tasks between servers where each task must be done by only one server?

Goal: There are X number backend servers. There are Y number of tasks. Each task must be done only by one server. The same task ran by two different servers should not happen.
There are tasks which include continuous work for an indefinite amount of time, such as polling for data. The same server can keep doing such a task as long as the server stays alive.
Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?
Well, the way you define your problem makes it sloppy to reason about. What you actually is looking for called a "distributed lock".
Let's start with a simpler problem: assume you have only two concurrent servers S1, S2 and a single task T. The safety property you stated remains as is: at no point in time both S1 and S2 may process task T. How could that be achieved? The following strategies come to mind:
Implement an algorithm that deterministically maps task to a responsible server. For example, it could be as stupid as if task.name.contains('foo') then server1.process(task) else server2.process(task). That works and indeed might fit some real world requirements out there, yet such an approach is a dead end: a) you have to know how many server would you have upfront, statically and - the most dangerous - 2) you can not tolerate either server being down: if, say, S1 is taken off then there is nothing you can do with T right now except then just wait for S1 to come back online. These drawbacks could be softened, optimized - yet there is no way to get rid of them; escaping these deficiencies requires a more dynamic approach.
Implement an algorithm that would allow S1 and S2 to agree upon who is responsible for the T. Basically, you want both S1 and S2 to come to a consensus about (assumed, not necessarily needed) T.is_processed_by = "S1" or T.is_processed_by = "S2" property's value. Then your requirement translates to the "at any point in time is_process_by is seen by both servers in the same way". Hence "consensus": "an agreement (between the servers) about a is_processed_by value". Having that eliminates all the "too static" issues of the previous strategy: actually, you are no longer bound to 2 servers, you could have had n, n > 1 servers (provided that your distributed consensus works for a chosen n), however it is not prepared for accidents like unexpected power outage. It could be that S1 won the competition, is_processed_by became equal to the "S1", S2 agreed with that and... S1 went down and did nothing useful....
...so you're missing the last bit: the "liveness" property. In simple words, you'd like your system to continuously progress whenever possible. To achieve that property - among many other things I am not mentioning - you have to make sure that spontaneous server's death is monitored and - once it happened - not a single task T gets stuck for indefinitely long. How do you achieve that? That's another story, a typical piratical solution would be to copy-paste the good old TCP's way of doing essentially the same thing: meet the keepalive approach.
OK, let's conclude what we have by now:
Take any implementation of a "distributed locking" which is equivalent to "distributed consensus". It could be a ZooKeeper done correctly, a PostgreSQL running a serializable transaction or whatever alike.
Per each unprocessed or stuck task T in your system, make all the free servers S to race for that lock. Only one of them guaranteed to win and all the rest would surely loose.
Frequently enough push sort of TCP's keepalive notifications per each processing task or - at least - per each alive server. Missing, let say, three notifications in a sequence should be taken as server's death and all of it's tasks should be re-marked as "stuck" and (eventually) reprocessed in the previous step.
And that's it.
P.S. Safety & liveness properties is something you'd definitely want to be aware of once it comes to distributed computing.
Try rabbitmq worker queues
https://www.rabbitmq.com/tutorials/tutorial-two-python.html
It has an acknowledgement feature so if a task fails or server cashes it will automatically replay your task. Based on your specific use case u can setup retries, etc
"Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?"
You are getting into a known problem in distributed systems, how does a system makes decisions when the system is partitioned. Let me elaborate on this.
A simple statement "server dies" requires quite a deep dive on what does this actually mean. Did the server lost power? Is it the network between your control plane and the server is down (and the task is keep running)? Or, maybe, the task was done successfully, but the failure happened just before the task server was about to report about it? If you want to be 100% correct in deciding the current state of the system - that the same as to say that the system has to be 100% consistent.
This is where CAP theorem (https://en.wikipedia.org/wiki/CAP_theorem) comes to play. Since your system may be partitioned at any time (a worker server may get disconnected or die - which is the same state) and you want to be 100% correct/consistent, this means that the system won't be 100% available.
To reiterate the previous paragraph: if the system suspects a task server is down, the system as a whole will have to come to a stop, till it will be able to determine on what happened with the particular task server.
Trade off between consistency and availability is the core of distributed systems. Since you want to be 100% correct, you won't have 100% availability.
While availability is not 100%, you still can improve the system to make it as available as possible. Several approaches may help with that.
Simplest one is to alert a human when the system suspects it is down. The human will get a notification (24/7), wake up, login and do a manual check on what is going on. Whether this approach works for your case - it depends on how much availability you need. But this approach is completely legit and is widely used in the industry (those engineers carrying pagers).
More complicated approach is to let the system to fail over to another task server automatically, if that is possible. Few options are available here, depending on type of task.
First type of task is a re-runnable one, but they have to exist as a single instance. In this case, the system uses "STONITH" (shoot the other node in the head) technic to make sure previous node is dead for good. For example, in a cloud the system would actually kill the whole container of task server and then start a new container as a failover.
Second type of tasks is not re-runnable. For example, a task of transferring money from account A to be B is not (automatically) re-runnable. System does not know if the task failed before or after the money were moved. Hence, the fail over needs to do extra steps to calculate the outcome, which may also be impossible if network is not working correctly. In this cases the system usually goes to halt, till it can make 100% correct decision.
None of these options will give 100% of availability, but they can do as good as possible due to nature of distributed systems.

What happens with nested branches and speculative execution?

Alright, so I know that if a particular conditional branch has a condition that takes time to compute (memory access, for instance), the CPU assumes a condition result and speculatively executes along that path. However, what would happen if, along that path, yet another slow conditional branch pops up (assuming, of course, that the first condition hasn't been resolved yet and the CPU can't just commit the changes)? Does the CPU just speculate inside the speculation? What happens if the last condition is mispredicted but the first wasn't? Does it just rollback all the way?
I'm talking about something like this:
if (value_in_memory == y){
// computations
if (another_val_memory == x){
//computations
}
}
Speculative execution is the regular state of execution, not a special mode that an out of order CPU enters when it sees a branch and then leaves when the branch is no longer in flight.
This is easier to see if you consider that it's not just branches that can fault, but many instructions, including those that access memory, have restrictions on their input values, etc. So any substantial out of order execution implies constant speculation, and CPUs are built around that idea.
So "nested branches" doesn't end up being special in that sense.
Now, modern CPUs have a variety of methods for quick branch misprediction recovery, faster than recovery from other types of faults1. For example they may snapshot the state of the register mapping at some branches, to allow recovery to start before the branch is at the head of the reorder buffer. Since it is not always feasible to snapshot at all branches, there might be complicated heuristics involved to decide where to take snapshots.
I mention this last part because it is one way in which nested branches might matter: when there are lots of branches in flight, you might hit some microarchitectural limits related to the tracking of these branches for recovery purposes. For more details, you can look through patents for "branch order buffer" (for Intel techniques, but there are no doubt others).
1 The basic recovery method is keep executing until the faulting instruction is the next to retire, and then throw away all younger instructions. In the context of branch mispredictions, this means you could actually suffer two or more mispredictions only the oldest of which actually takes effect: e.g., a younger branch mispredicts, and while executing up to that branch (at which point recovery can occur), another mispredict occurs, so the younger one ends up getting discarded.
(Maybe not a complete answer, but I had some of this written when #BeeOnRope posted an answer. Posting this anyway for some more links and technical details in case anyone's curious.)
Everything is always speculative until it reaches retirement and becomes non-speculative, definitely happened, part of the architectural state.
e.g. any load might fault with a bad address, any div might trap on divide by zero. See also Out-of-order execution vs. speculative execution That and What exactly happens when a skylake CPU mispredicts a branch? mention that branch mispredicts are handled specially, because they're expected to be frequent. Fast-recovery can start before a mis-predicted branch reaches retirement, unlike the behaviour for a faulting load for example. (That's part of why Meltdown is exploitable.)
So even "regular" instructions are executed speculatively before being commited, and the only distinction between them is a human-made distinction, not computer-made? I presume, then, that the CPU stores multiple, possible rollback points? For instance if I have load instructions that may lead to page faults or simply use stale values, inside a conditional branch, the CPU identifies such instructions and scenarios and saves a state for each of them? I feel like I misunderstood because this may lead to a lot of storing register states and complicated dependencies.
The retirement state is always consistent so you can always roll back to there and discard all in-flight work, e.g. if an external interrupt arrives you want to handle it without waiting for a chain of a dozen cache miss loads to all execute. When an interrupt occurs, what happens to instructions in the pipeline?
This tracking basically happens for free or is something you need to do anyway to be able to detect which instruction faulted, not just that there was a problem somewhere. (This is called "precise exceptions")
The real distinction humans can usefully make is speculation that has a real chance of being wrong during execution of non-error cases. If your code gets a bad pointer, it doesn't really matter how it performs; it's going to page-fault and that's going to be very slow compared to local OoO exec details.
You're talking about a modern out-of-order (OoO) execution (not just fetch) CPU, like modern Intel or AMD x86, high-end ARM, MIPS r10000, etc.
The front-end is in-order (with speculation down predicted paths), and so is commit (aka retirement) from the out-of-order back-end into non-speculative retirement state. (aka known-good architectural state).
The CPU uses two major structures to track instructions (or on x86, uops = parts of instructions) in the back-end. The last stage of the front-end (after fetch / decode) allocates/renames instructions and adds them into both of these structures at once.
RS = Reservation Station = scheduler: not-yet-executed instructions, waiting for an execution unit. The RS tracks dependencies and sends the oldest-ready uops to execution units that are ready.
ROB = ReOrder Buffer: not-yet-retired instructions. Instructions enter and leave in-order so it can just be a circular buffer.
Includes a flag to mark each entry as executed or not, set once the RS has sent it to an execution unit which reports success. The oldest instructions in the ROB that all have their done-executing bit set can "retire".
Also includes a flag which indicates "fault if this reaches retirement". This avoids spending time handling page faults from load instruction on the wrong path of execution (that might well have pointers into an unmapped page), for example. Either in the shadow of a branch mispredict, or just after another instruction (in program order) that should have faulted first but OoO exec got to it later.
(I'm also leaving out register-renaming onto a large physical register file.
That's the "rename" part. Allocate includes choosing which execution port an instruction will use, and reserving a load or store buffer entry for memory instructions.)
(There's also a store-buffer; stores don't write directly to L1d cache, they write to the store buffer. This makes it possible to speculatively execute stores and still roll back without them becoming visible to other cores. It also decouples cache-miss stores from execution. Once a store instruction retires, the store-buffer entry "graduates" and is eligible to commit to L1d cache, once MESI gets exclusive access to the cache line, and once memory-ordering rules are satisfied.)
Execution units detect whether an instruction should fault, or was mis-speculated and should roll back, but don't necessarily act on that until the instruction reaches retirement.
In-order retirement is the step that recovers program-order after OoO exec, including the case of exceptions of mis-speculation.
Terminology: Intel calls it "issue" when instructions are sent from the front-end into the ROB + RS. Other computer architecture people often call that "dispatch".
Sending uops from the RS to execution units is called "dispatch" by Intel, "issue" by other people.

Why do longer pipelines make a single delay slot insufficient?

I read the following statement in Patterson & Hennessy's Computer Organization and Design textbook:
As processors go to both longer pipelines and issuing multiple instructions per clock cycle, the branch delay becomes longer, and a single delay slot is insufficient.
I can understand why "issuing multiple instructions per clock cycle" can make a single delay slot insufficient, but I don't know why "longer pipelines" cause it.
Also, I do not understand why longer pipelines cause the branch delay to become longer. Even with longer pipelines (step to finish one instruction), there's no guarantee that the cycle will increase, so why will the branch delay increase?
If you add any stages before the stage that detects branches (and evaluates taken/not-taken for conditional branches), 1 delay slot no longer hides the "latency" between the branch entering the first stage of the pipeline and the correct program-counter address after the branch being known.
The first fetch stage needs info from later in the pipeline to know what to fetch next, because it doesn't itself detect branches. For example, in superscalar CPUs with branch prediction, they need to predict which block of instructions to fetch next, separately and earlier from predicting which way a branch goes after it's already decoded.
1 delay slot is only sufficient in MIPS I because branch conditions are evaluated in the first half of a clock cycle in EX, in time to forward to the 2nd half of IF which doesn't need a fetch address until then. (Original MIPS is a classic 5-stage RISC: IF ID EX MEM WB.) See Wikipedia's article on the classic RISC pipeline for much more details, specifically the control hazards section.
That's why MIPS is limited to simple conditions like beq (find any mismatches from an XOR), or bltz (sign bit check). It cannot do anything that requires an adder for carry propagation (so a general blt between two registers is only a pseudo-instruction).
This is very restrictive: a longer front-end can absorb the latency from a larger/more associative L1 instruction cache that takes more than half a cycle to respond on a hit. (MIPS I decode is very simple, though, with the instruction format intentionally designed so machine-code bits can be wired directly as internal control signals. So you can maybe make decode the "half cycle" stage, with fetch getting 1 full cycle, but even 1 cycle is still low with shorter cycle times at higher clock speeds.)
Raising the clock speed might require adding another fetch stage. Decode does have to detecting data hazards and set up bypass forwarding; original MIPS kept that simpler by not detecting load-use hazards, instead software had to respect a load-delay slot until MIPS II. A superscalar CPU has many more possible hazards, even with 1-cycle ALU latency, so detecting what has to forward to what requires more complex logic for matching destination registers in old instructions against sources in younger instructions.
A superscalar pipeline might even want some buffering in instruction fetch to avoid bubbles. A multi-ported register file might be slightly slower to read, maybe requiring an extra decode pipeline stage, although probably that can still be done in 1 cycle.
So, as well as making 1 branch delay slot insufficient by the very nature of superscalar execution, a longer pipeline also increases branch latency, if the extra stages are between fetch and branch resolution. e.g. an extra fetch stage and a 2-wide pipeline could have 4 instructions in flight after a branch instead of 1.
But instead of introducing more branch delay slots to hide this branch delay, the actual solution is branch prediction. (However some DSPs or high performance microcontrollers do have 2 or even 3 branch delay slots.)
Branch-delay slots complicate exception handling; you need a fault-return and a next-after-that address, in case the fault was in a delay slot of a taken branch.

Akka and state among actors in cluster

I am working on my bc thesis project which should be a Minecraft server written in scala and Akka. The server should be easily deployable in the cloud or onto a cluster (not sure whether i use proper terminology...it should run on multiple nodes). I am, however, newbie in akka and i have been wondering how to implement such a thing. The problem i'm trying to figure out right now, is how to share state among actors on different nodes. My first idea was to have an Camel actor that would read tcp stream from minecraft clients and then send it to load balancer which would select a node that would process the request and then send some response to the client via tcp. Lets say i have an AuthenticationService implementing actor that checks whether the credentials provided by user are valid. Every node would have such actor(or perhaps more of them) and all the actors should have exactly same database (or state) of users all the time. My question is, what is the best approach to keep this state? I have came up with some solutions i could think of, but i haven't done anything like this so please point out the faults:
Solution #1: Keep state in a database. This would probably work very well for this authentication example where state is only represented by something like list of username and passwords but it probably wouldn't work in cases where state contains objects that can't be easily broken into integers and strings.
Solution #2: Every time there would be a request to a certain actor that would change it's state, the actor will, after processing the request, broadcast information about the change to all other actors of the same type whom would change their state according to the info send by the original actor. This seems very inefficient and rather clumsy.
Solution #3: Having a certain node serve as sort of a state node, in which there would be actors that represent the state of the entire server. Any other actor, except the actors in such node would have no state and would ask actors in the "state node" everytime they would need some data. This seems also inefficient and kinda fault-nonproof.
So there you have it. Only solution i actually like is the first one, but like i said, it probably works in only very limited subset of problems (when state can be broken into redis structures). Any response from more experienced gurus would be very appriciated.
Regards, Tomas Herman
Solution #1 could possibly be slow. Also, it is a bottleneck and a single point of failure (meaning the application stops working if the node with the database fails). Solution #3 has similar problems.
Solution #2 is less trivial than it seems. First, it is a single point of failure. Second, there are no atomicity or other ordering guarantees (such as regularity) for reads or writes, unless you do a total order broadcast (which is more expensive than a regular broadcast). In fact, most distributed register algorithms will do broadcasts under-the-hood, so, while inefficient, it may be necessary.
From what you've described, you need atomicity for your distributed register. What do I mean by atomicity? Atomicity means that any read or write in a sequence of concurrent reads and writes appears as if it occurs in single point in time.
Informally, in the Solution #2 with a single actor holding a register, this guarantees that if 2 subsequent writes W1 and then W2 to the register occur (meaning 2 broadcasts), then no other actor reading the values from the register will read them in the order different than first W1 and then W2 (it's actually more involved than that). If you go through a couple of examples of subsequent broadcasts where messages arrive to destination at different points in time, you will see that such an ordering property isn't guaranteed at all.
If ordering guarantees or atomicity aren't an issue, some sort of a gossip-based algorithm might do the trick to slowly propagate changes to all the nodes. This probably wouldn't be very helpful in your example.
If you want fully fault-tolerant and atomic, I recommend you to read this book on reliable distributed programming by Rachid Guerraoui and Luís Rodrigues, or the parts related to distributed register abstractions. These algorithms are built on top of a message passing communication layer and maintain a distributed register supporting read and write operations. You can use such an algorithm to store distributed state information. However, they aren't applicable to thousands of nodes or large clusters because they do not scale, typically having complexity polynomial in the number of nodes.
On the other hand, you may not need to have the state of the distributed register replicated across all of the nodes - replicating it across a subset of your nodes (instead of just one node) and accessing those to read or write from it, providing a certain level of fault-tolerance (only if the entire subset of nodes fails, will the register information be lost). You can possibly adapt the algorithms in the book to serve this purpose.

Howto design a clock driven multi-agent simulation

I want to create a multi-agent simulation model for a real word manufacturing process to evaluate some dispatching rules. The simulation needs to produce event logs to evaluate time effect of the dispatching rules compared to the real manufacturing event logs.
How can I incorporate the 'current simulation time' into this kind of multi-agent, message passing intensive simulation?
Background:
The classical discrete event simulation (which handles the time-advancement nicely) cannot be applied here, as the agents in the system represent relatively complex behavior and routing requirements plus the dispatching rules require them to communicate frequently. This and other process complexities rule out a centralized scheduling approach as well.
In the manufacturing science, there are thousands of papers using a multi-agent simulation for their solution of some manufacturing related problem. However, I haven't found a paper yet which describes the inner workings or implementation details of these simulations in the required detail.
Unfortunately, using the shortest process time for discrete time stepping in a system might be infeasible as the range of process time is between 0.1s and 24 hours. There is a possibility my simulation will be used for what-if evaluations in a project later on so the simulation needs to run as fast as possible - no option for overnight simulation runs.
The problem size is about 500 resources and 1000 - 10000 product agents, most of them is finished and not participating in any further communication or resource occupation.
Consequently, in result to the communication new events can trigger an agent to do something before its original 'next time' event would arrive. For example, an agent is currently blocked on a resource lasting an hour. However, another higher priority agent needs that resource right away and asks the fist agent to release that resource.
In some sense, I need a way to create a hybrid of classical message passing agent-simulation and the discrete event simulation.
I considered a mediator agent that is involved in every message - a message router and time enforcer which sends around the messages and the timer tick events. Also the mediator agent keeps a list of next event times for various agents. However, I feel there should be a better way to solve my problem as the concept puts an enormous pressure at the mediator agent.
Update
It took a while, but it seems I managed to create a mini-framework and combined the DES and Agent concept into one. I'm sure its nothing new but at least unique: http://code.google.com/p/tidra-framework/ if you are interested.
This problem sounds as if it should be tackled by using parallel discrete-event simulation - the mediator agent you are planning to implement ('is involved in every message', 'sends around messages and timer tick events') seems to be doing the job of a discrete-event simulator right now. You can make this scale to the desired problem size by using more of such simulators in parallel and then use a synchronization algorithm to maintain causality etc. (see, e.g., this book for details). Of course, this requires some considerable effort, and you might be better off by really trying out the sequential algorithms first.
A nice way of augmenting the classical DES-view of logical processes (= agents) that communicate with each other via events could be to blend in some ideas from other formalisms used to describe discrete-event systems, such as DEVS. In DEVS, each entity can specify the duration it will be in a certain state (e.g., the agent blocking a resource), and will only be interrupted by incoming messages (and then change its state accordingly, e.g. the agent freeing the resource).
BTW In which sense do you think that the agents are too complex to be handled with discrete-event simulation? If you regard each agent as a logical process, it doesn't really matter how complex it is from a simulation point of view - or am I getting something wrong here?