How to identify added and modified children after reconnecting to Zookeeper? - apache-zookeeper

We use Zookeeper to coordinate task execution among our clustered servers. One of our customers have a very instable network and our servers keep disconnecting and reconnecting to Zookeeper.
The problem is that while being disconnected, our servers will miss the events that occurred and won't handle them even after re-connecting to Zookeeper again.
Is there a recommened\standard method to handle such situations using Zookeeper and Apache Curator ?
How to identify the current epoch time at Zookeeper ?
My proposal so far is:
We keep track of the last time we were connected to Zookeeper. That's right before we get disconnected.
On re-connecting again, we ask the listener to clearAndRefresh which fires CHILD_ADDED events for all child nodes for monitored path.
On handling these CHILD_ADDED events, we only handled those for paths that were created or modified after the last time we were connected to Zookeeper.

I don't think using timestamp will be a good idea. Instead, you can use Curator's inbuilt:
TreeCache if you want to watch an entire tree
PathChildrenCache if you want to watch only a sub directory.
It doesn't matter which one you use, both support listening to ChildAdded and DataChanged events which will do exactly what you need. When you reconnect after been disconnected, Curator will internally evaluate newly added children and compare data of existing children to determine changes. No pressure on you. You only need to use the listeners provided.
In terms of accuracy TreeCache is not guaranteeing 100% accuracy. So, you it is better if you can re-design you approach to use PathChildrenCache instead.

Related

Kubernetes - How do I prevent duplication of work when there are multiple replicas of a service with a watcher?

I'm trying to build an event exporter as a toy project. It has a watcher that gets informed by the Kubernetes API every time an event, and as a simple case, let's assume that it wants to store the event in a database or something.
Having just one running instance is probably susceptible to failures, so ideally I'd like two. In this situation, the naive implementation would both instances trying to store the event in the database so it'd be duplicated.
What strategies are there to de-duplicate? Do I have to do it at the database level (say, by using some sort of eventId or hash of the event content) and accept the extra database load or is there a way to de-duplicate at the instance level, maybe built into the Kubernetes client code? Or do I need to implement some sort of leader election?
I assume this is a pretty common problem. Is there a more general term for this issue that I can search on to learn more?
I looked at the code for GKE event exporter as a reference but I was unable to find any de-duplication, so I assume that it happens on the receiving end.
You should use both leader election and de-duplication at your watcher level. Only one of them won't be enough.
Why need leader election?
If high availability is your main concern, you should have leader election between the watcher instances. Only the leader pod will write the event to the database. If you don't use leader election, the instances will race with each other to write into the database.
You may check if the event has been already written in the database and then write it. However, you can not guarantee that other instances won't write into the database between when you checked and when you write the event. In that case, database level lock / transaction might help.
Why need de-duplication?
Only leader election will not save you. You also need to implement de-duplication. If your leader pod restart, it will resync all the existing events. So, you should have a check whether to process the event or not.
Furthermore, if a failover happen, how you know from the new leader about which events were successfully exported by previous leader?

How to handle out of order Zookeeper notifications?

I have multiple processes operating on an in-memory queue. That queue is a manifestation of sequential znodes created/deleted at Zookeeper.
When a znode is added, an equivalent item is added to the queue at all the involved processes. And also when a znode is removed, the equivalent item is removed from the queue at every involved process.
The addition and removal signals are expected to be balanced because every added item should eventually be removed.
I faced a situation when a znode was added and removed very quickly and the removal notification was received at one of the processes before the addition notificaiton. So an attempt to remove that item occurred but failed because it wasn't actually there, and then the addition signal was received which added the item but then it was never removed.
A simple solution would be to assert the existence of the equivalent znode after adding the item to the queue and that's good enough for me now but it doesn't seem as efficient as it can get.
My question is if there is a way to handle this scenario in a more efficient or "zookeeper way"?
You're trying to use ZooKeeper as a message queue which is not designed for. There's no ordering neither delivery guarantee in ZooKeeper for watcher notifications.
Instead you should use some messaging system like Kafka or RabbitMQ for this use case.

Can "observer" nodes in zookeeper respond with stale results?

This question is in reference to https://zookeeper.apache.org/doc/trunk/zookeeperObservers.html
Observers are non-voting members of an ensemble which only hear the
results of votes, not the agreement protocol that leads up to them.
Other than this simple distinction, Observers function exactly the
same as Followers - clients may connect to them and send read and
write requests to them. Observers forward these requests to the Leader
like Followers do, but they then simply wait to hear the result of the
vote. Because of this, we can increase the number of Observers as much
as we like without harming the performance of votes.
Observers have other advantages. Because they do not vote, they are
not a critical part of the ZooKeeper ensemble. Therefore they can
fail, or be disconnected from the cluster, without harming the
availability of the ZooKeeper service. The benefit to the user is that
Observers may connect over less reliable network links than Followers.
In fact, Observers may be used to talk to a ZooKeeper server from
another data center. Clients of the Observer will see fast reads, as
all reads are served locally, and writes result in minimal network
traffic as the number of messages required in the absence of the vote
protocol is smaller.
1) non-voting members of an ensemble - What do the voting members vote on?
2) How does an update request work for observers - When a ZK leader gets an update request, it requires a quorum of nodes to respond. Observer nodes seems like is not considered a quorum node. Does that mean an observer node lags behind the leader node for updates? If that is true, how does it ensure that observer nodes do not respond with stale data during reads?
3) Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller - Reads from all the other nodes will also be local only because they are in-sync with the leader, no? And I did not get the part about writes.
These questions should be good to understanding zookeeper and distributed systems in general. Appreciate a good detailed answer for these. Thanks in advance !
1) non-voting members of an ensemble - What do the voting members vote on?
Typical members of the ensemble (not observers) vote on success/failure of proposed changes coordinated by the leader. There is some further discussion of the details in the paper ZooKeeper: Wait-free coordination for Internet-scale systems.
2) How does an update request work for observers - When a ZK leader gets an update request, it requires a quorum of nodes to respond. Observer nodes seems like is not considered a quorum node. Does that mean an observer node lags behind the leader node for updates? If that is true, how does it ensure that observer nodes do not respond with stale data during reads?
You are correct that observer nodes are not considered necessary participants in the quorum. In general, update lag will be subject to network latency between the observer and the leader. (Whether or not this is noticeable is subject to specific external factors, such as whether or not the observer and leader are in the same data center with a low-latency network link.)
Note that even without use of observers, there is no guarantee that every server in the ensemble is always completely up to date. The Apache ZooKeeper documentation on Consistency Guarantees contains this disclaimer:
Sometimes developers mistakenly assume one other guarantee that ZooKeeper does not in fact make. This is:
Simultaneously Consistent Cross-Client Views ZooKeeper does not
guarantee that at every instance in time, two different clients will
have identical views of ZooKeeper data. Due to factors like network
delays, one client may perform an update before another client gets
notified of the change. Consider the scenario of two clients, A and B.
If client A sets the value of a znode /a from 0 to 1, then tells
client B to read /a, client B may read the old value of 0, depending
on which server it is connected to. If it is important that Client A
and Client B read the same value, Client B should should call the
sync() method from the ZooKeeper API method before it performs its
read.
However, clients of ZooKeeper will never appear to "go back in time" by reading stale data from a point in time prior to the data they already read. This is accomplished by attaching a monotonically increasing transaction ID (called "zxid") to each ZooKeeper transaction. When the ZooKeeper client interacts with a server, it compares the client's last seen zxid to the current zxid of the server. If the server is behind the client, then it will not allow the client's next read to be processed by that server.
3) Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller - Reads from all the other nodes will also be local only because they are in-sync with the leader, no? And I did not get the part about writes.
It's important to note that this statement from the documentation is written in the context of an important use-case for observers: multiple data center deployments with higher network latency between different data centers. In this statement, "served locally" means served from a ZooKeeper server within the same data center as the client, so that it doesn't suffer from the longer latency of connecting to another data center. For full context, here is a copy of the full quote:
In fact, Observers may be used to talk to a ZooKeeper server from another data center. Clients of the Observer will see fast reads, as all reads are served locally, and writes result in minimal network traffic as the number of messages required in the absence of the vote protocol is smaller.

How are out-of-order and wait-free writes handled?

As stated in Guarantees:
Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Let's assume a client makes 2 updates (update1 and update2) in a very short time window (I understand zookeeper is good at read-domination applications). So my questions are:
Is that possible update2 is received before update1, therefore for zookeeper update1 has later stamp than that of update2? I assume yes due to network connection nature. If this the case that means client will lose its update2 and will have update1. Is there anyway zookeeper can ACK back the client with different stamp or whatever other data that let the client to determine if update2 is really received after update1. Basically zookeeper tells what it sees from server side to client, which gives client some info to act if that's not what the client wants.
What if there is a leader failure after receiving and confirming update1 and before receiving update2? I assume such writes are persisted somewhere in disk/DB etc. When the new leader comes back will it catch up first, meaning conduct update1, before confirming update2 back to client?
Just curious, since zookeeper claims it supports wait-free writing, does that mean there is a message queue built inside zookeeper to hold incoming writes? Otherwise if the leader has to make sure the update is populated to all other followers, the client is actually being blocked by during this replication process. I am guessing that's part of reason zookeeper does not support heavy write application.
For the first two questions, I think you can find details in Zookeeper's paper.
It's quite normal that different operations from the same client arrive in disorder to Zookeeper node. But Zookeeper use TCP to ensure that sequential network package will be receive orderly.
Leader must write operations in Write-Ahead-Log before it can confirm operations. The problems will diverge in two dimensions. The first situation we should consider is whether the leader could recover before followers realize leader failure. If yes, nothing bad will happen, all operations in failure time will lost, and client will resend the operations. If not, then we should consider whether the Leader has proposed a proposal before it fails. If it fails before proposing a proposal, then client will know the failure. If it has proposed a proposal, there must be at least one node in the cluster which has got the newest transactions. Then it will be the new Leader in next rolling. When the original Leader recovers from failure, it will realize he's no longer the leader(All transactions of Zookeeper contains a 64-bits transaction id, of which the higher 32 bits represent epoch, and the lower 32 bits represents proposal id). It will communicate with new Leader and then get updated(Sometimes it need truncate it's local transaction log first).
I don't know the details since I haven't read ZooKeeper's source code. But Leader only needs over half acknowledge from followers before it response to clients. Zookeeper provide both blocking and non-blocking API and you can choose what you like.

Concerns about zookeeper's lock-recipe

While reading the ZooKeeper's recipe for lock, I got confused. It seems that this recipe for distributed locks can not guarantee "any snapshot in time no two clients think they hold the same lock". But since ZooKeeper is so widely adopted, if there were such mistakes in the reference documentation, someone should have pointed it out long ago, so what did I misunderstand?
Quoting the recipe for distributed locks:
Locks
Fully distributed locks that are globally synchronous, meaning at any snapshot in time no two clients think they hold the same lock. These can be implemented using ZooKeeeper. As with priority queues, first define a lock node.
Call create( ) with a pathname of "locknode/guid-lock-" and the sequence and ephemeral flags set.
Call getChildren( ) on the lock node without setting the watch flag (this is important to avoid the herd effect).
If the pathname created in step 1 has the lowest sequence number suffix, the client has the lock and the client exits the protocol.
The client calls exists( ) with the watch flag set on the path in the lock directory with the next lowest sequence number.
if exists( ) returns false, go to step 2. Otherwise, wait for a notification for the pathname from the previous step before going to step 2.
Consider the following case:
Client1 successfully acquired the lock (in step 3), with ZooKeeper node "locknode/guid-lock-0";
Client2 created node "locknode/guid-lock-1", failed to acquire the lock, and is now watching "locknode/guid-lock-0";
Later, for some reason (say, network congestion), Client1 fails to send a heartbeat message to the ZooKeeper cluster on time, but Client1 is still working away, mistakenly assuming that it still holds the lock.
But, ZooKeeper may think Client1's session is timed out, and then
delete "locknode/guid-lock-0",
send a notification to Client2 (or maybe send the notification first?),
but can not send a "session timeout" notification to Client1 in time (say, due to network congestion).
Client2 gets the notification, goes to step 2, gets the only node ""locknode/guid-lock-1", which it created itself; thus, Client2 assumes it hold the lock.
But at the same time, Client1 assumes it holds the lock.
Is this a valid scenario?
The scenario you describe could arise. Client 1 thinks it has the lock, but in fact its session has timed out, and Client 2 acquires the lock.
The ZooKeeper client library will inform Client 1 that its connection has been disconnected (but the client doesn't know the session has expired until the client connects to the server), so the client can write some code and assume that his lock has been lost if he has been disconnected too long. But the thread which uses the lock needs to check periodically that the lock is still valid, which is inherently racy.
...But, Zookeeper may think client1's session is timeouted, and then...
From the Zookeeper documentation:
The removal of a node will only cause one client to wake up since
each node is watched by exactly one client. In this way, you avoid
the herd effect.
There is no polling or timeouts.
So I don't think the problem you describe arises. It looks to me as thought there could be a risk of hanging locks if something happens to the clients that create them, but the scenario you describe should not arise.
from packt book - Zookeeper Essentials
If there was a partial failure in the creation of znode due to connection loss, it's
possible that the client won't be able to correctly determine whether it successfully
created the child znode. To resolve such a situation, the client can store its session ID
in the znode data field or even as a part of the znode name itself. As a client retains
the same session ID after a reconnect, it can easily determine whether the child znode
was created by it by looking at the session ID.