Dijkstra's token termination algorithm in Promela - dijkstra

My teacher has assigned us to program Dijkstra's token termination algorithm in Promela. Here is the algorithm:
"Every node maintains a counter c. Sending a message increases c by one; the receipt of a message decreases c by one. The sum of all counters thus equals the number of messages pending in the network.
When node 0 initiates a detection probe, it sends a token with a value 0 to node N-1. Every node i keeps the token until it becomes passive; it then forwards the token to node i-1 increasing the token value by c.
Every node and also the token has a color (initially all white). When a node receives a message, the node turns black. When a node forwards the token, the node turns white. If a black machine forwards the token, the token turns black; otherwise the token keeps its color.
When node 0 receives the token again, it can conclude termination, if
node 0 is passive and white,
the token is white, and
the sum of the token value and c is 0.
Otherwise, node 0 may start a new probe."
I somewhat get the idea, but I am having trouble really understanding what is going on. If someone could help explain exactly what the algorithm is doing step-by-step, that would be great. Also, while I don't want anyone writing the code for me, any tips on how to get started would be helpful.

Related

What are the limits on actorevents in service fabric?

I am currently testing the scaling of my application and I ran into something I did not expect.
The application is running on a 5 node cluster, it has multiple services/actortypes and is using a shared process model.
For some component it uses actor events as a best effort pubsub system (There are fallbacks in place so if a notification is dropped there is no issue).
The problem arises when the number of actors grows (aka subscription topics). The actorservice is partitioned to 100 partitions at the moment.
The number of topics at that point is around 160.000 where each topic is subscribed 1-5 times (nodes where it is needed) with an average of 2.5 subscriptions (Roughly 400k subscriptions).
At that point communications in the cluster start breaking down, new subscriptions are not created, unsubscribes are timing out.
But it is also affecting other services, internal calls to a diagnostics service are timing out (asking each of the 5 replicas), this is probably due to the resolving of partitions/replica endpoints as the outside calls to the webpage are fine (these endpoints use the same technology/codestack).
The eventviewer is full with warnings and errors like:
EventName: ReplicatorFaulted Category: Health EventInstanceId {c4b35124-4997-4de2-9e58-2359665f2fe7} PartitionId {a8b49c25-8a5f-442e-8284-9ebccc7be746} ReplicaId 132580461505725813 FaultType: Transient, Reason: Cancelling update epoch on secondary while waiting for dispatch queues to drain will result in an invalid state, ErrorCode: -2147017731
10.3.0.9:20034-10.3.0.13:62297 send failed at state Connected: 0x80072745
Error While Receiving Connect Reply : CannotConnect , Message : 4ba737e2-4733-4af9-82ab-73f2afd2793b:382722511 from Service 15a5fb45-3ed0-4aba-a54f-212587823cde-132580461224314284-8c2b070b-dbb7-4b78-9698-96e4f7fdcbfc
I've tried scaling the application but without this subscribe model active and I easily reach a workload twice as large without any issues.
So there are a couple of questions
Are there limits known/advised for actor events?
Would increasing the partition count or/and node count help here?
Is the communication interference logical? Why are other service endpoints having issues as well?
After time spent with the support ticket we found some info. So I will post my findings here in case it helps someone.
The actor events use a resubscription model to make sure they are still connected to the actor. Default this is done every 20 seconds. This meant a lot of resources were being used and eventually the whole system overloaded with loads of idle threads waiting to resubscribe.
You can decrease the load by setting resubscriptionInterval to a higher value when subscribing. The drawback is that it will also mean the client will potentially miss events in the mean time (if a partition is moved).
To counteract the delay in resubscribing it is possible to hook into the lower level service fabric events. The following psuedo code was offered to me in the support call.
Register for endpoint change notifications for the actor service
fabricClient.ServiceManager.ServiceNotificationFilterMatched += (o, e) =>
{
var notification = ((FabricClient.ServiceManagementClient.ServiceNotificationEventArgs)e).Notification;
/*
* Add additional logic for optimizations
* - check if the endpoint is not empty
* - If multiple listeners are registered, check if the endpoint change notification is for the desired endpoint
* Please note, all the endpoints are sent in the notification. User code should have the logic to cache the endpoint seen during susbcription call and compare with the newer one
*/
List<long> keys;
if (resubscriptions.TryGetValue(notification.PartitionId, out keys))
{
foreach (var key in keys)
{
// 1. Unsubscribe the previous subscription by calling ActorProxy.UnsubscribeAsync()
// 2. Resubscribe by calling ActorProxy.SubscribeAsync()
}
}
};
await fabricClient.ServiceManager.RegisterServiceNotificationFilterAsync(new ServiceNotificationFilterDescription(new Uri("<service name>"), true, true));
Change the resubscription interval to a value which fits your need.
Cache the partition id to actor id mapping. This cache will be used to resubscribe when the replica’s primary endpoint changes(ref #1)
await actor.SubscribeAsync(handler, TimeSpan.FromHours(2) /*Tune the value according to the need*/);
ResolvedServicePartition rsp;
((ActorProxy)actor).ActorServicePartitionClientV2.TryGetLastResolvedServicePartition(out rsp);
var keys = resubscriptions.GetOrAdd(rsp.Info.Id, key => new List<long>());
keys.Add(communicationId);
The above approach ensures the below
The subscriptions are resubscribed at regular intervals
If the primary endpoint changes in between, actorproxy resubscribes from the service notification callback
This ends the psuedo code form the support call.
Answering my original questions:
Are there limits known/advised for actor events?
No hard limits, only resource usage.
Would increasing the partition count or/and node count help here? Partition count not. node count maybe, only if that means there are less subscribing entities on a node because of it.
Is the communication interference logical? Why are other service endpoints having issues as well?
Yes, resource contention is the reason.

Error "Attempt to transfer unavailable funds"

I have a problem with "quick" assets transferring through several accounts. For example, I have 3 accounts: A, B(no assets), C.
I transfer some amount of asset from A to B account, then look for "A->B" transaction (/transactions/info/{id}).
If the transaction was found I transfer the same amount from B to C account. In most cases everything is fine, but sometimes I get an error:
StateCheckFailedException: State check failed. Reason: Attempt to
transfer unavailable funds: Transaction application leads to negative
asset 'IssuedAsset(...)' balance to (at least) temporary negative
state, the current balance is 0 ...
If I will wait for 10 seconds, for example, "B->C" transfer will be succeeded. So, it seems that I should wait for some synchronization of the account's balances in the node.
Is there any guaranteed way in my case to make "B->C" transfer without waiting for an undetermined time? "A->B" transaction's presence in a blockchain doesn't work sometimes.
I use my own node for the broadcasting transactions. Node's configuration is the default. Version 1.1.7
This is due the fact from microblocks. Microblocks did transfer the funds, but didn't confirmed them yet permanently and therefore aren't yet fully in the blockchain.
If you want to be fully sure, I would say, wait 1 block (or 60 seconds). But even then it might not always be confirmed, since even waveschain has some moments it's being overloaded with thousands of tx's for a few minutes.
It's the same with other actions like creating assets etc.

What is the purpose of Chubby Sequencers

While reading article from google about chubby, I didn't really understand the purpose of sequencers
Assume we have 4 entities :
Chubby cell
Client 1
Client 2
Service we want to use and where we will send the requests (for which we need the lock)
As far as I understood the steps are:
Client 1 send lock_request() to Chubby cell, Chubby responses with Sequencer (assume SequenceNumber = 1)
Client 1 send request modify_data() with Sequencer (SequenceNumber = 1) to Service
Service asks Chubby cell if SequenceNumber is valid (=1)
Chubby acknowledges it, set LeasePeriod (period of lock expiration to (assume) 60 seconds)
! during this time no one is able to acquire the lock
After acknowledge, Service cache the data about Client 1 (SequenceNumber = 1) for (assume) 40 seconds
Now:
if Client 2 tries to acquire lock during these 60 seconds we set, it will be rejected by Chubby cell
that means it is impossible that Client 2 will acquire the lock with the next SequenceNumber = 2 and send anything to the Service
As far as I understand all purpose of SequenceNumber is just for situation when 2 requests come to Service and Service can just compare 2 SequenceNumbers and reject the lower, without need to ask Chubby cell
but how this situation will ever happen if we have caches and impossibility to get the lock by Client 2 while Client 1 is holding this lock?
It will be a mistake to think about timing in distributed systems with actual times (like seconds), but I'll try to answer using the same semantics.
As you said, say client1 acquires write lock named foo1,
foo here being the lock name and 1 being the generation number.
Now say, lease period is 60 seconds. 58th second now Client1 sends a write, say R1.
And soon enough, Client1 is now dead.
Now, here's the catch. You assumed in your analysis, that R1 would reach
the server inside the 2 seconds, before another client, say Client2 becomes master.
THAT'S JUST NOT CERTAIN.
In a distributed system, with fractions of milliseconds network latencies on one hand and network partitions on the other hand,
you just cannot ascertain what reaches the master first, R1 or client2's request to become master.
This is where sequence numbers would help.
Master, now having known that there is foo2, can reject R1 that came with foo1 in metadata.
Read more about generational clocks/logical clocks here.
A logical clock is a mechanism for capturing chronological and causal relationships in a distributed system. Often, distributed systems may have no physically synchronous global clock. Fortunately, in many applications (such as distributed GNU make), if two processes never interact, the lack of synchronization is unobservable. Moreover, in these applications, it suffices for the processes to agree on the event ordering (i.e., logical clock) rather than the wall-clock time.[1]

Does the Raft consensus protocol handle nodes that lost contact to the leader but not to the other nodes?

In case of network partitions, Raft stays consistent. But what does happen if only a single node loses contact only to the leader, becomes a candidate and calls for votes?
This is the setup, I adjusted the examples from http://thesecretlivesofdata.com/raft/ to fit my needs:
Node B is the current leader and sends out heartbeats (red) to the followers. The connection between B and C gets lost and after the election timeout C becomes a candidate, votes for itself and asks nodes A, D and E to vote for it (green).
What does happen?
As far as I understand Raft, nodes A, D and E should vote for C which makes C the next leader (Term 2). We then have two leaders each sending out heartbeats, and hopefully nodes A, D and E will ignore those from B because of the lower term.
Is this correct or is there some better mechanism?
After going through the Raft Paper again, it seems that my above approach was correct. From the paper:
Terms
act as a logical clock in Raft, and they allow servers
to detect obsolete information such as stale leaders. Each
server stores a current term number, which increases
monotonically over time. Current terms are exchanged
whenever servers communicate; if one server’s current
term is smaller than the other’s, then it updates its current
term to the larger value. If a candidate or leader discovers
that its term is out of date, it immediately reverts to follower
state. If a server receives a request with a stale term
number, it rejects the request
The highlighted part is the one I was missing above. So the process is:
After node C has become candidate, it increases its term-number to 2 and requests votes from the reachable nodes (A, D and E).
Those will immediately update their current_term variable to 2 and vote for C.
Thus, nodes A, D and E will ignore heartbeats from B and moreover tell B that the current term is 2.
B will return into follower state (and won't get updated until the network connection between C and B is healed).
Since A, B, D keeps health heartbeat to the the leader B (Term 1 ), they would not responds to the vote request from C ( Term 2 ), C will timeout and repeat vote and repeat timeout.
As the Figure 4 from the raft paper https://raft.github.io/raft.pdf

Why Paxos is design in two phases

Why Paxos requires two phases(prepare/promise + accept/accepted) instead of a single one? That is, using only prepare/promise portion, if the proposer has heard back from a majority of acceptors, that value is choose.
What is the problem, does it break safety or liveness?
It breaks safety not to follow the full protocol.
Typical implementations of multi-paxos have a steady state mode where a stable leader streams Accept messages containing fresh values. Only when a problem occurs (leader crashes, stalls, or is partitioned by a network issue) does a new leader need to issue prepare messages to ensure saftey. A full description of this is in the write-up of how TRex an open source Paxos library implements Paxos.
Consider the following crash scenario which TRex would handle properly:
Nodes A, B, C with A leading
Client application sends V1 to leader A
Node A is in steady state and so sends accept(n, V1) to nodes B and C. The network starts to fail though so only B sees the message and it replies with accepted(n)
Node A sees the response and has a majority {A,B} so it knows the value is fixed due to the safety proof of the protocol.
Node A attempts to broadcast the outcome to everyone just as it's server dies. Only the client application who issued the V1 gets the message. Imagine that V1 is a customer order and upon learning the order is fixed the client application debts the customer credit card.
Node C times out on the dead leader and attempts to lead. It never saw the value V1. It cannot arbitrarily choose any new value without rolling back the order V1 but the customer has already been charged.
So Node C first issues a prepare(n+1) and node B responds with promise(n+1, V1).
Node C then issues accept(n+1, V1) and as long as the remaining messages get through nodes B and C will learn the value V1 was chosen.
Intuitively we can say that Node C has chosen to collaborate with the dead node A by choosing A's value. So intuitively we can see why there must be two rounds. The first round is needed to discover whether there is any pending work to finish. The second round is used to fix the correct value to give consistency across all processes within the system.
It's not entirely accurate, but you can think of the two phases as 1) copying the data, and then 2) committing the data. If the data is just copied to the other servers, those servers would have no idea if enough other servers have the data for it to be considered safe to serve. Thus there is a second phase to let the servers know that they can commit the data.
Paxos is a little more complex than that, which allows it to continue during failures of either phase. Part of the Paxos proof is that it is the minimal protocol for doing this completely. That is, other protocols do more work, either because they add more capabilities, or because they were poorly designed.