Zookeeper, do watches on nodes that are modified block all the other reads? - apache-zookeeper

My understanding of ZooKeeper is that a client will always execute requests in an ordered manner from ITS point of view.
Therefore if the client 1 issues:
write node A
2 reads node A, B
write node B
they will be executed in that order.
But in case client 1 has also a watch on a node C, and client 2 writes that node, does that write on node C impacts/blocks reads from client 1?
For example:
Client 1: starts watching C
Client 1: writes node A
Client 2: writes C
(Client 1: does client 1 block until the watch of C is fired? What if at this point the Client 1 tries to write node C?)
Client 1: 3 reads node A then B then C
Client 1: writes node B

But in case client 1 has also a watch on a node C, and client 2 writes
that node, does that write on node C impacts/blocks reads from client
1?
It doesn't block reads from client1 because (check here):
When a ZooKeeper object is created, two threads are created as well:
an IO thread and an event thread. All IO happens on the IO thread
(using Java NIO). All event callbacks happen on the event thread.
but impacts client1 with thee see of events in this order (check here):
A client will see a watch event for a znode it is watching before
seeing the new data that corresponds to that znode.
Next question was:
Does client 1 block until the watch of C is fired?
No (see the explanation above).
What if at this point the Client 1 tries to write node C?
It will overwrite client2's write because it happens after according to your bullets sequence (ZooKeeper is ordered; check also here in order to understand how ZooKeeper achive the ordering).

Related

How to wait while replicas are caught up master

There is a mongodb cluster (1 master 2 replicas)
Updating records in a larger number and for this used BulkWrite, need to call next BulkWrite after replicas caught up master, need to make sure that the replicas have already caught up with the master for this request. Used go.mongodb.org/mongo-driver/mongo
Write propagation can be "controlled" with a writeconcern.WriteConcern.
WriteConcern can be created using different writeconcern.Options. The writeconcern.W() function can be used to create a writeconcern.Option that controls how many instances writes must be progatated to (or more specifically the write operation will wait for acknowledgements from the given number of instances):
func W(w int) Option
W requests acknowledgement that write operations propagate to the specified number of mongod instances.
So first you need to create a writeconcern.WriteConcern that is properly configured to wait for acknowledgements from 2 nodes in your case (you have 1 master and 2 replicas):
wc := writeconcern.New(writeconcern.W(2))
You now have to choose the scope of this write concern: it may be applied on the entire mongo.Client, it may be applied on a mongo.Database or just applied on a mongo.Collection.
Creating a client, obtaining a database or just a collection all have options which allow specifying the write concern, so this is really up to you where you want it to be applied.
If you just want this write concern to have effect on certain updates, it's enough to apply it on mongo.Collection which you use to perform the updates:
var client *mongo.Client // Initialize / connect client
wc := writeconcern.New(writeconcern.W(2))
c := client.Database("<your db name>").
Collection("<your collection name>",
options.Collection().SetWriteConcern(wc))
Now if you use this c collection to perform the writes (updates), the wc write concern will be used / applied, that is, each write operation will wait for acknowledgements of the writes propagated to 2 instances.
If you would apply the wc write concert on the mongo.Client, then that would be the default write concern for everything you do with that client, if you'd apply it on a mongo.Database, then it would be the default for everything you do with that database. Of course the default can be overridden, just how we applied wc on the c collection in the above example.

What is the purpose of Chubby Sequencers

While reading article from google about chubby, I didn't really understand the purpose of sequencers
Assume we have 4 entities :
Chubby cell
Client 1
Client 2
Service we want to use and where we will send the requests (for which we need the lock)
As far as I understood the steps are:
Client 1 send lock_request() to Chubby cell, Chubby responses with Sequencer (assume SequenceNumber = 1)
Client 1 send request modify_data() with Sequencer (SequenceNumber = 1) to Service
Service asks Chubby cell if SequenceNumber is valid (=1)
Chubby acknowledges it, set LeasePeriod (period of lock expiration to (assume) 60 seconds)
! during this time no one is able to acquire the lock
After acknowledge, Service cache the data about Client 1 (SequenceNumber = 1) for (assume) 40 seconds
Now:
if Client 2 tries to acquire lock during these 60 seconds we set, it will be rejected by Chubby cell
that means it is impossible that Client 2 will acquire the lock with the next SequenceNumber = 2 and send anything to the Service
As far as I understand all purpose of SequenceNumber is just for situation when 2 requests come to Service and Service can just compare 2 SequenceNumbers and reject the lower, without need to ask Chubby cell
but how this situation will ever happen if we have caches and impossibility to get the lock by Client 2 while Client 1 is holding this lock?
It will be a mistake to think about timing in distributed systems with actual times (like seconds), but I'll try to answer using the same semantics.
As you said, say client1 acquires write lock named foo1,
foo here being the lock name and 1 being the generation number.
Now say, lease period is 60 seconds. 58th second now Client1 sends a write, say R1.
And soon enough, Client1 is now dead.
Now, here's the catch. You assumed in your analysis, that R1 would reach
the server inside the 2 seconds, before another client, say Client2 becomes master.
THAT'S JUST NOT CERTAIN.
In a distributed system, with fractions of milliseconds network latencies on one hand and network partitions on the other hand,
you just cannot ascertain what reaches the master first, R1 or client2's request to become master.
This is where sequence numbers would help.
Master, now having known that there is foo2, can reject R1 that came with foo1 in metadata.
Read more about generational clocks/logical clocks here.
A logical clock is a mechanism for capturing chronological and causal relationships in a distributed system. Often, distributed systems may have no physically synchronous global clock. Fortunately, in many applications (such as distributed GNU make), if two processes never interact, the lack of synchronization is unobservable. Moreover, in these applications, it suffices for the processes to agree on the event ordering (i.e., logical clock) rather than the wall-clock time.[1]

How to create a simple nodes-to-sink communication pattern (multi-hop topology) in Castalia Simulator

I am facing some teething problems in Castalia Simulator, while creating a simple nodes-to-sink communication pattern.
I want to create a unidirectional topology as describe follows
node 0 <-------> node 1<----------->node 2<-------->node 3
source =node 0
relay node= node 1, 2
Sink node = node 3
Here messages flow from left to right, so node 0 sends only to node 1, node 1 sends only to node 2, and node 2 sends only to node 3. When node 0 want to send data packet to node 3, then node 1 and node 2 worked as intermediate nodes (relay nodes/ forwarding nodes). The neighbor nodes can also send data in unidirectional fashion (left to right) such as node 0 sends to node 1, node 1 sends to node 2 etc.
I read manual and understand the ApplicationName ="ThroughputTest" , but according to my understanding here, all nodes will send data to sink (node 0).
I added following lines in omnetpp.init file:-
SN.node[0].Application.nextRecipient = "1"
SN.node[1].Application.nextRecipient = "2"
SN.node[2].Application.nextRecipient = "3"
SN.node[3].Application.nextRecipient = "3"
But I am not getting my desire result.
Please help me regard this .
Regards
Gulshan Soni
We really need more information to figure out what you have done.
The part of your omnetpp.ini file you copied here, just shows that you are defining some static app-level routing using the app module ThroughputTest
There are so many other parts to a network. Firstly, the definition of the MAC plays a crucial role. For example, if you have chosen MAC 802.15.4 or BaselineBANMAC, you cannot have multihop routing, since there is only hub to slave nodes communication. Furthermore, how you define the radio and the channel, can also impact communication. For example, the signal might not be strong enough to reach from one node to another.
Read the Castalia User's Manual carefully, and provide enough information in your questions so that others can replicate your results.

How raft algorithm maintains strong read consistency in case of write failure followed by a node failure

Consider three nodes(A,B,C) getting key/value data. And the following steps happened
Node A receive key:value (1:15). It is a leader
It replicate to node B and node C
Entry made to node B in pre commit log
Node C fail the entry
Ack from node B is lost.
Nod A fail the entry and sent failure to client
Node A is still leader and B is not in quorum
Client read from node A for key 1 and it returned old value.
node A is down
Node B and node C is up
now node B has an entry in precommit log and node C doesn't.
How does log matching happen at this time. Is node Bgoing to commit that entry or going to discard it. If it is going to commit thenit would be read inconsistent or if it is going to discard then there could be data loss in other cases
The error is in step 8. Every read operation must be replicated to other nodes otherwise you risk getting stale data, the system should serve read after it writes a dummy value to the log. In your case (B is offline), the "read" must affect nodes A and C, so when node B comes back online and A dies, C would be able to invalidate B's records.
This is a tricky problem and even Etcd run into it in the past (now it's fixed).

Concerns about zookeeper's lock-recipe

While reading the ZooKeeper's recipe for lock, I got confused. It seems that this recipe for distributed locks can not guarantee "any snapshot in time no two clients think they hold the same lock". But since ZooKeeper is so widely adopted, if there were such mistakes in the reference documentation, someone should have pointed it out long ago, so what did I misunderstand?
Quoting the recipe for distributed locks:
Locks
Fully distributed locks that are globally synchronous, meaning at any snapshot in time no two clients think they hold the same lock. These can be implemented using ZooKeeeper. As with priority queues, first define a lock node.
Call create( ) with a pathname of "locknode/guid-lock-" and the sequence and ephemeral flags set.
Call getChildren( ) on the lock node without setting the watch flag (this is important to avoid the herd effect).
If the pathname created in step 1 has the lowest sequence number suffix, the client has the lock and the client exits the protocol.
The client calls exists( ) with the watch flag set on the path in the lock directory with the next lowest sequence number.
if exists( ) returns false, go to step 2. Otherwise, wait for a notification for the pathname from the previous step before going to step 2.
Consider the following case:
Client1 successfully acquired the lock (in step 3), with ZooKeeper node "locknode/guid-lock-0";
Client2 created node "locknode/guid-lock-1", failed to acquire the lock, and is now watching "locknode/guid-lock-0";
Later, for some reason (say, network congestion), Client1 fails to send a heartbeat message to the ZooKeeper cluster on time, but Client1 is still working away, mistakenly assuming that it still holds the lock.
But, ZooKeeper may think Client1's session is timed out, and then
delete "locknode/guid-lock-0",
send a notification to Client2 (or maybe send the notification first?),
but can not send a "session timeout" notification to Client1 in time (say, due to network congestion).
Client2 gets the notification, goes to step 2, gets the only node ""locknode/guid-lock-1", which it created itself; thus, Client2 assumes it hold the lock.
But at the same time, Client1 assumes it holds the lock.
Is this a valid scenario?
The scenario you describe could arise. Client 1 thinks it has the lock, but in fact its session has timed out, and Client 2 acquires the lock.
The ZooKeeper client library will inform Client 1 that its connection has been disconnected (but the client doesn't know the session has expired until the client connects to the server), so the client can write some code and assume that his lock has been lost if he has been disconnected too long. But the thread which uses the lock needs to check periodically that the lock is still valid, which is inherently racy.
...But, Zookeeper may think client1's session is timeouted, and then...
From the Zookeeper documentation:
The removal of a node will only cause one client to wake up since
each node is watched by exactly one client. In this way, you avoid
the herd effect.
There is no polling or timeouts.
So I don't think the problem you describe arises. It looks to me as thought there could be a risk of hanging locks if something happens to the clients that create them, but the scenario you describe should not arise.
from packt book - Zookeeper Essentials
If there was a partial failure in the creation of znode due to connection loss, it's
possible that the client won't be able to correctly determine whether it successfully
created the child znode. To resolve such a situation, the client can store its session ID
in the znode data field or even as a part of the znode name itself. As a client retains
the same session ID after a reconnect, it can easily determine whether the child znode
was created by it by looking at the session ID.