How to determine Last write win on concurrent Vector clocks? - distributed-computing

I'd like to keep track of only the recent data and also employ the help of Vector clocks in resolving issues so I can easily discard data via L-W-W rule.(last write wins)
Say we have 3 nodes:
- Node1
- Node2
- Node3
Then we would use Vector clocks to keep track of causality and concurrency on each events/changes. We represent Vector clocks initially with
{Node1:0, Node2:0, Node3:0}.
For instance Node1 gets 5 local changes it would mean we increment its clock by 5 increments that would result into
{Node1: 5, Node2:0, Node3:0}.
This would be normally okay right?
Then what if at the same time Node2 updates its local and also incremented its clock resulting into
{Node1:0, Node2:1, Node3:0}.
At some point Node1 sends an event to Node3 passing the updates and piggybacking its vectorclock. So Node3 which has a VC of {Node1:0, Node2:0, Node3:0} would easily just merge the data and clock as there are no changes on it yet.
The problem I'm thinking about how to deal with is what would happen if Node2 sends an event to update into Node3 passing it's own VC and updates.
What would happen to the data and the clocks. How do I apply Last Write wins here when the first one that gets written to Node3 which was from Node1 would basically appear as the later write as it have a greater VC value on its own clock.
Node3's clock before merging: {Node1: 5, Node2: 0 , Node3: 1}
Node2's messagevc that Node3 received: {Node1:0, Node2:1, Node3:0}
How do I handle resolving data on concurrent VCs?

This is a good question. You're running into this issue because you are using counters in your vector clocks, and you are not synchronizing the counters across nodes. You have a couple of options:
Submit all writes through one primary server. The primary server can apply a total order to all of the writes and then send them to the individual nodes to be stored. It would be helpful to have some background on your system. For example, why are there three independent nodes? Do they exist to provide replication and availability? If so, this primary server approach would work well.
Keep your server's times in sync, as described in Google's Spanner paper. Then, instead of using a monotonically increasing counter for each node in the vector clock, you can use a timestamp based off of the server's time. Again, having background on your system would be helpful. If your system just consists of human users submitting writes, then you might be able to get away with keeping your server's times loosely in sync using NTP without violating the LWW invariant.

Related

Quarkus Scheduled Records Processing mechanism Best Practice

What is the best practice or way to process the records from DB in scheduled.
Situation:
A Microservice based on Quarkus - responsible for sending a communication to customers.
DB Table Having Customers Records (100000 customers)
Microservice is running on multiple nodes (4 nodes)
Expectation:
There should be a scheduler that runs every 5 sec
Fetches the records from DB where employee status = pending
Should be Multithreaded architecture.
Send email to employee email.
Problem 1:
The same scheduler running on multiple nodes picks the same records and process How can we avoid this?
Problem 2:
Scheduler pics (100 records and processing it) and takes more than 5 seconds and scheduler run again pics few same records. How can we avoid that:
If you are planning to run your microservices on kubernetes I would sugest to use an external components as a scheduler and let this component distribute the work over your microservices using messages or HTTP invocations.
As responses to your questions here we go:
You can use some locking strategy or "reserve" each row including a field that indicates that your record is being processed and excluding all records containing this fields from your query. By this means when the scheduler fires it will read a set of rows not reserved and use a multithreading approach to process the records, by using a locking strategy (pesimits or optimist) you can prevent other records from marking the same row as reserved for them to be processed. After that the thread thas was able to commit the reserve process the records and updates the state or releases the "reserve" so other workers can work on the record if its needed.
You can always instruct your scheduler to do no execute if there is still an execution going.
#Scheduled(identity = "ProcessUpdateScheduler", every = "2s", concurrentExecution = Scheduled.ConcurrentExecution.SKIP)
You mainly have two approaches among other possible ones:
Pulling (Distribute mining or work distribution): Each instance of the microservice pick a random pending row and mark this row as "processing" commiting the transaction, if its able to commit then this instance holds the right to process this record continuing with its execution, if not it tries to retrieve a different row or just exists waiting for the next invocation. This approach scales horizontally because adding more workers will mean increasing your processing throughput.
Pushing (central distribution, distributed processing). You have two kinds of components: First the "Distributor" which is executed with the scheduler and is responsible for picking rows to be processed and marking then as "pending processing", this rows will be forward via a messaging system or HTTP call to the "Processor". The Processor component recieves as input a record and is responsible of processing this record completely or releasing the hold ("procesing pending") state.
Choouse the best suited for your scenario, if you go for the second option, you can have one or more distributors if its necessary, but in order to increment your processing throughput you only need to scale the "Processor" workers

What is the purpose of Chubby Sequencers

While reading article from google about chubby, I didn't really understand the purpose of sequencers
Assume we have 4 entities :
Chubby cell
Client 1
Client 2
Service we want to use and where we will send the requests (for which we need the lock)
As far as I understood the steps are:
Client 1 send lock_request() to Chubby cell, Chubby responses with Sequencer (assume SequenceNumber = 1)
Client 1 send request modify_data() with Sequencer (SequenceNumber = 1) to Service
Service asks Chubby cell if SequenceNumber is valid (=1)
Chubby acknowledges it, set LeasePeriod (period of lock expiration to (assume) 60 seconds)
! during this time no one is able to acquire the lock
After acknowledge, Service cache the data about Client 1 (SequenceNumber = 1) for (assume) 40 seconds
Now:
if Client 2 tries to acquire lock during these 60 seconds we set, it will be rejected by Chubby cell
that means it is impossible that Client 2 will acquire the lock with the next SequenceNumber = 2 and send anything to the Service
As far as I understand all purpose of SequenceNumber is just for situation when 2 requests come to Service and Service can just compare 2 SequenceNumbers and reject the lower, without need to ask Chubby cell
but how this situation will ever happen if we have caches and impossibility to get the lock by Client 2 while Client 1 is holding this lock?
It will be a mistake to think about timing in distributed systems with actual times (like seconds), but I'll try to answer using the same semantics.
As you said, say client1 acquires write lock named foo1,
foo here being the lock name and 1 being the generation number.
Now say, lease period is 60 seconds. 58th second now Client1 sends a write, say R1.
And soon enough, Client1 is now dead.
Now, here's the catch. You assumed in your analysis, that R1 would reach
the server inside the 2 seconds, before another client, say Client2 becomes master.
THAT'S JUST NOT CERTAIN.
In a distributed system, with fractions of milliseconds network latencies on one hand and network partitions on the other hand,
you just cannot ascertain what reaches the master first, R1 or client2's request to become master.
This is where sequence numbers would help.
Master, now having known that there is foo2, can reject R1 that came with foo1 in metadata.
Read more about generational clocks/logical clocks here.
A logical clock is a mechanism for capturing chronological and causal relationships in a distributed system. Often, distributed systems may have no physically synchronous global clock. Fortunately, in many applications (such as distributed GNU make), if two processes never interact, the lack of synchronization is unobservable. Moreover, in these applications, it suffices for the processes to agree on the event ordering (i.e., logical clock) rather than the wall-clock time.[1]

Can "observer" nodes in zookeeper respond with stale results?

This question is in reference to https://zookeeper.apache.org/doc/trunk/zookeeperObservers.html
Observers are non-voting members of an ensemble which only hear the
results of votes, not the agreement protocol that leads up to them.
Other than this simple distinction, Observers function exactly the
same as Followers - clients may connect to them and send read and
write requests to them. Observers forward these requests to the Leader
like Followers do, but they then simply wait to hear the result of the
vote. Because of this, we can increase the number of Observers as much
as we like without harming the performance of votes.
Observers have other advantages. Because they do not vote, they are
not a critical part of the ZooKeeper ensemble. Therefore they can
fail, or be disconnected from the cluster, without harming the
availability of the ZooKeeper service. The benefit to the user is that
Observers may connect over less reliable network links than Followers.
In fact, Observers may be used to talk to a ZooKeeper server from
another data center. Clients of the Observer will see fast reads, as
all reads are served locally, and writes result in minimal network
traffic as the number of messages required in the absence of the vote
protocol is smaller.
1) non-voting members of an ensemble - What do the voting members vote on?
2) How does an update request work for observers - When a ZK leader gets an update request, it requires a quorum of nodes to respond. Observer nodes seems like is not considered a quorum node. Does that mean an observer node lags behind the leader node for updates? If that is true, how does it ensure that observer nodes do not respond with stale data during reads?
3) Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller - Reads from all the other nodes will also be local only because they are in-sync with the leader, no? And I did not get the part about writes.
These questions should be good to understanding zookeeper and distributed systems in general. Appreciate a good detailed answer for these. Thanks in advance !
1) non-voting members of an ensemble - What do the voting members vote on?
Typical members of the ensemble (not observers) vote on success/failure of proposed changes coordinated by the leader. There is some further discussion of the details in the paper ZooKeeper: Wait-free coordination for Internet-scale systems.
2) How does an update request work for observers - When a ZK leader gets an update request, it requires a quorum of nodes to respond. Observer nodes seems like is not considered a quorum node. Does that mean an observer node lags behind the leader node for updates? If that is true, how does it ensure that observer nodes do not respond with stale data during reads?
You are correct that observer nodes are not considered necessary participants in the quorum. In general, update lag will be subject to network latency between the observer and the leader. (Whether or not this is noticeable is subject to specific external factors, such as whether or not the observer and leader are in the same data center with a low-latency network link.)
Note that even without use of observers, there is no guarantee that every server in the ensemble is always completely up to date. The Apache ZooKeeper documentation on Consistency Guarantees contains this disclaimer:
Sometimes developers mistakenly assume one other guarantee that ZooKeeper does not in fact make. This is:
Simultaneously Consistent Cross-Client Views ZooKeeper does not
guarantee that at every instance in time, two different clients will
have identical views of ZooKeeper data. Due to factors like network
delays, one client may perform an update before another client gets
notified of the change. Consider the scenario of two clients, A and B.
If client A sets the value of a znode /a from 0 to 1, then tells
client B to read /a, client B may read the old value of 0, depending
on which server it is connected to. If it is important that Client A
and Client B read the same value, Client B should should call the
sync() method from the ZooKeeper API method before it performs its
read.
However, clients of ZooKeeper will never appear to "go back in time" by reading stale data from a point in time prior to the data they already read. This is accomplished by attaching a monotonically increasing transaction ID (called "zxid") to each ZooKeeper transaction. When the ZooKeeper client interacts with a server, it compares the client's last seen zxid to the current zxid of the server. If the server is behind the client, then it will not allow the client's next read to be processed by that server.
3) Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller - Reads from all the other nodes will also be local only because they are in-sync with the leader, no? And I did not get the part about writes.
It's important to note that this statement from the documentation is written in the context of an important use-case for observers: multiple data center deployments with higher network latency between different data centers. In this statement, "served locally" means served from a ZooKeeper server within the same data center as the client, so that it doesn't suffer from the longer latency of connecting to another data center. For full context, here is a copy of the full quote:
In fact, Observers may be used to talk to a ZooKeeper server from another data center. Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller.

Communication protocol

I'm developing distributed system that consists of master and worker servers. There should be 2 kind of messages:
Heartbeat
Master gets state of worker and respond immediately with appropriate command. For instance:
Message from Worker to Master: "Hey there! I have data a,b,c"
Response from Master to Worker: "All ok, But throw away c - we dont need this anymore"
The participants exchange this messages with interval T.
Direct master command
Lets say client asks master to kill job #123. Here is conversation:
Message from Master to Worker: "Alarm! We need to kill job #123"
Message from Worker to Master: "No problem! Done."
Obvious that we can't predict when this message appear.
Simplest solution is that master is initiator of all communications for both messages (in case of heartbeat we will include another one from master to start exchange). But lets assume that it is expensive to do all heartbeat housekeeping on master side for N workers. And we don't want to waste our resources to keep several tcp connections to worker servers so we have just one.
Is there any solution for this constraints?
First off, you have to do some bookkeeping somewhere. Otherwise, who's going to realize that a worker has died? The natural place to put that data is on the master, if you're building a master/worker system. Otherwise, the workers could be asked to keep track of each other in a long circle, or a randomized graph. If a worker notices that their accountabilibuddy is not responding anymore, it can alert the master.
Same thing applies to the list of jobs currently running; who keeps track of that? It also scales O(n), so presumably the master doesn't have space for that either. Sharding that data out among the workers (e.g. by keeping track of what things their accountabilibuddy is supposed to be doing) only works so far; if a and b crashes, and a is the only one looking after b, you just lost the list of jobs running on b (and possibly the alert that was supposed to notify you that b crashed).
I'd recommend a distributed consensus algorithm for this kind of task. For production, use something someone else has already written; they probably know what they're doing. If it's for learning purposes, which I presume, have a look at the raft consensus algorithm. It's not too hard to understand, but still highlights a lot of the complexity in distributed systems. The simulator is gold for proper understanding.
A master/worker system will never properly work with less than O(n) resources for n workers in the face of crashing workers. By definition, the master needs to control the workers, which is an O(n) job, even if some workers manage other workers. Also, what happens if the master crashes?
Like Filip Haglund said read the raft paper you should also implement it yourself. However in a nutshell what you need to extract from it would be this. In regaurds to membership management.
You need to keep membership lists and the masters Identity on all nodes.
Raft does it's heartbeat sending on master's end it is not very expensive network wise you don't need to keep them open. Every 200 ms to a second you need to send the heartbeat if they don't reply back the Master tells the slaves remove member x from list.
However what what to do if the master dies well basically you need to preset candidate nodes. If you haven't received a heart beat within the timeout the candidate requests votes from the rest of the cluster. If you get the slightest majority you become the new leader.
If you want to join a existing cluster basically same as above if not leader respond not leader with leaders address.

What to do if the leader fails in Multi-Paxos for master-slave systems?

Backgound:
In section 3, named Implementing a State Machine, of Lamport's paper Paxos Made Simple, Multi-Paxos is described. Multi-Paxos is used in Google Paxos Made Live. (Multi-Paxos is used in Apache ZooKeeper). In Multi-Paxos, gaps can appear:
In general, suppose a leader can get α commands ahead--that is, it can propose commands i + 1 through i + α commands after commands 1 through i are chosen. A gap of up to α - 1 commands could then arise.
Now consider the following scenario:
The whole system uses master-slave architecture. Only the master serves client commands. Master and slaves reach consensus on the sequence of commands via Multi-Paxos. The master is the leader in Multi-Paxos instances. Assume now the master and two of its slaves have the states (commands have been chosen) shown in the following figure:
.
Note that, there are more than one gaps in the master state. Due to asynchrony, the two slaves lag behind. At this time, the master fails.
Problem:
What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?
In particular, how to handle with the gaps and the missing commands with respect to that of the old master?
Update about Zab:
As #sbridges has pointed out, ZooKeeper uses Zab instead of Paxos. To quote,
Zab is primarily designed for primary-backup (i.e., master-slave) systems, like ZooKeeper, rather than for state machine replication.
It seems that Zab is closely related to my problems listed above. According to the short overview paper of Zab, Zab protocol consists of two modes: recovery and broadcast. In recovery mode, two specific guarantees are made: never forgetting committed messages and letting go of messages that are skipped. My confusion about Zab is:
In recovery mode does Zab also suffer from the gaps problem? If so, what does Zab do?
The gap should be the Paxos instances that has not reached agreement. In the paper Paxos Made Simple, the gap is filled by proposing a special “no-op” command that leaves the state unchanged.
If you cares about the order of chosen values for Paxos instances, you'd better use Zab instead, because Paxos does not preserve causal order. https://cwiki.apache.org/confluence/display/ZOOKEEPER/PaxosRun
The missing command should be the Paxos instances that has reached agreement, but not learned by learner. The value is immutable because it has been accepted a quorum of acceptor. When you run a paxos instance of this instance id, the value will be chosen and recovered to the same one on phase 1b.
When slaves/followers detected a failure on Leader, or the Leader lost a quorum support of slaves/follower, they should elect a new leader.
In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO.
In recovery mode, after the leader is elected, the follower synchronize with leader first, and apply the modification on state until NEWLEADER is received.
In broadcast mode, the follower queues the PROPOSAL in pendingTxns, and wait the COMMIT in the same order. If the zxid of COMMIT mismatch with the zxid of head of pendingTxns, the follower will exit.
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Zab1.0
Multi-Paxos is used in Apache ZooKeeper
Zookeeper uses zab, not paxos. See this link for the difference.
In particular, each zookeeper node in an ensemble commits updates in the same order as every other nodes,
Unlike client requests, state updates must be applied in the exact
original generation order of the primary, starting from the original
initial state of the primary. If a primary fails, a new primary that
executes recovery cannot arbitrarily reorder uncommitted state
updates, or apply them starting from a different initial state.
Specifically the ZAB paper says that a newly elected leader undertakes discovery to learn the next epoch number to set and who has the most up-to-date commit history. The follower sands an ACK-E message which states the max contiguous zxid it has seen. It then says that it undertakes a synchronisation phase where it transmits the state which followers which they have missed. It notes that in interesting optimisation is to only elect a leader which has a most up to date commit history.
With Paxos you don't have to allow gaps. If you do allow gaps then the paper Paxos Made Simple explains how to resolve them from page 9. A new leader knows the last committed value it saw and possibly some committed values above. It probes the slots from the lowest gap it knows about by running phase 1 to propose to those slots. If there are values in those slots it runs phase 2 to fix those values but if it is free to set a value it sets no-op value. Eventually it gets to the slot number where there have been no values proposed and it runs as normal.
In answer to your questions:
What should the slaves do after they have detected the failure of the master (for example, by heartbeat mechanism)?
They should attempt to lead after a randomised delay to try to reduce the risk of two candidates proposing at the same time which would waste messages and disk flushes as only one can lead. Randomised leader timeout is well covered in the Raft paper; the same approach can be used for Paxos.
In particular, how to handle with the gaps and the missing commands with respect to that of the old master?
The new leader should probe and fix the gaps to either the highest value proposed to that slot else a no-op until it has filled in the gaps then it can lead as normal.
The answer of #Hailin explains the gap problem as follows:
In zookeeper, there should be no gaps as the follower communicates with leader by TCP which keeps FIFO"
To supplement:
In the paper A simple totally ordered broadcast protocol, it mentions that ZooKeeper requires the prefix property:
If $m$ is the last message delivered for a leader $L$, any message proposed before $m$ by $L$ must also be delivered".
This property mainly relies on the TCP mechanism used in Zab. In Zab Wiki, it mentions that the implementation of Zab must follow the following assumption (besides others):
Servers must process packets in the order that they are received. Since TCP maintains ordering when sending packets, this means that packets will be processed in the order defined by the sender.