Cluster Becomes Unresponsive When One Node gores OOM - distributed-computing

We have created a cluster with three nodes using Hazelcast 3.4.2 and I'm having
following issue.
If one node goes OOM, other nodes become unresponsive. Sometime those nodes
(except one that went to OOM) manage to recover however, recovery time is not predictable.
Also, we added following two Hazelcast properties as JVM parameters. However, still the issue persists in the cluster.
hazelcast.client.heartbeat.timeout
hazelcast.max.no.heartbeat.seconds
Please node that, cluster was started several times by giving few different values to above two Hazelcast properties.
So I would like to know, whether this is a know-issue or not. Also, if above scenario
is a know-issue, do we have a workaround for this issue.
Thanks

Do your members have enough headroom? When one member goes down then the same amount of data has to be distributed among less members. It could cause memory pressure on them. I'd recommend to enabled verbose GC log and test your scenario.

Related

Proper Fault-tolerant/HA setup for KeyDB/Redis in Kubernetes

Sorry for a long post, but I hope it would relieve us from some of clarifying questions. I also added some diagrams to split the wall of text, hope you'll like those.
We are in the process of moving our current solution to local Kubernetes infrastructure, and the current thing we investigate is the proper way to setup a KV-store (we've been using Redis for this) in the K8s.
One of the main use-cases for the store is providing processes with exclusive ownership for resources via a simple version of a Distibuted lock pattern, as in (discouraged) pattern here. (More on why we are not using Redlock below).
And once again, we are looking for a way to set it in the K8s, so that details of HA setup are opaque to clients. Ideally, the setup would look like this:
So what is the proper way to setup Redis for this? Here are the options that we considered:
First of all, we discarded Redis cluster, because we don't need sharding of keyspace. Our keyspace is rather small.
Next, we discarded Redis Sentinel setup, because with sentinels clients are expected to be able to connect to chosen Redis node, so we would have to expose all nodes. And also will have to provide some identity for each node (like distinct ports, etc) which contradicts with idea of a K8s Service. And even worse, we'll have to check that all (heterogeneous) clients do support Sentinel protocol and properly implement all that fiddling.
Somewhere around here we got out of options for the first time. We thought about using regular Redis replication, but without Sentinel it's unclear how to set things up for fault-tolerance in case of master failure — there seem to be no auto-promotion for replicas, and no (easy) way to tell K8s that master has been changed — except maybe for inventing a custom K8s operator, but we are not that desperate (yet).
So, here we came to idea that Redis may be not very cloud-friendly, and started looking for alternatives. And so we found KeyDB, which has promising additional modes. That's besides impressing performance boost while having 100% compatible API — very impressive!
So here are the options that we considered with KeyDB:
Active replication with just two nodes. This would look like this:
This setup looks very promising at first — simple, clear, and even official KeyDB docs recommend this as a preferred HA setup, superior to Sentinel setup.
But there's a caveat. While the docs advocate this setup to be tolerant to split-brains (because the nodes would catch up one to another after connectivity is re-established), this would ruin our use-case, because two clients would be able to lock same resource id:
And there's no way to tell K8s that one node is OK, and another is unhealthy, because both nodes have lost their replicas.
Well, it's clear that it's impossible to make an even-node setup to be split-brain-tolerant, so next thing we considered was KeyDB 3-node multi-master, which allows each node to be an (active) replica of multiple masters:
Ok, things got more complicated, but it seems that the setup is brain-split proof:
Note that we had to add more stuff here:
health check — to consider a node that lost all its replicas as unhealthy, so K8s load balancer would not route new clients to this node
WAIT 1 command for SET/EXPIRE — to ensure that we are writing to a healthy split (preventing case when client connects to unhealthy node before load balancer learns it's ill).
And this is when a sudden thought struck: what's about consistency?? Both these setups with multiple writable nodes provide no guard against two clients both locking same key on different nodes!
Redis and KeyDB both have asynchronous replication, so there seem to be no warranty that if an (exclusive) SET succeeds as a command, it would not get overwritten by another SET with same key issued on another master a split-second later.
Adding WAITs does not help here, because it only covers spreading information from master to replicas, and seem to have no affect on these overlapping waves of overwrites spreading from multiple masters.
Okay now, this is actually the Distributed Lock problem, and both Redis and KeyDB provide the same answer — use the Redlock algorithm. But it seem to be quite too complex:
It requires client to communicate with multiple nodes explicitly (and we'd like to not do that)
These nodes are to be independent. Which is rather bad, because we are using Redis/KeyDB not only for this locking case, and we'd still like to have a reasonably fault-tolerant setup, not 5 separate nodes.
So, what options do we have? Both Redlock explanations do start from a single-node version, which is OK, if the node will never die and is always available. And while it's surely not the case, but we are willing to accept the problems that are explained in the section "Why failover-based implementations are not enough" — because we believe failovers would be quite rare, and we think that we fall under this clause:
Sometimes it is perfectly fine that under special circumstances, like during a failure, multiple clients can hold the lock at the same time. If this is the case, you can use your replication based solution.
So, having said all of this, let me finally get to the question: how do I setup a fault-tolerant "replication-based solution" of KeyDB to work in Kubernetes, and having a single write node most of the time?
If it's a regular 'single master, multiple replicas' setup (without 'auto'), what mechanism would assure promoting replica in case of master failure, and what mechanism would tell Kubernetes that master node has changed? And how? By re-assigning labels on pods?
Also, what would restore a previously dead master node in such a way that it would not become a master again, but a replica of a substitute master?
Do we need some K8s operator for this? (Those that I found were not smart enough to do this).
Or if it's multi-master active replication from KeyDB (like in my last picture above), I'd still need to use something instead of LoadBalanced K8s Service, to route all clients to a single node at time, and then again — to use some mechanism to switch this 'actual master' role in case of failure.
And this is where I'd like to ask for your help!
I've found frustratingly little info on the topic. And it does not seem that many people have such problems that we face. What are we doing wrong? How do you cope with Redis in the cloud?

ActiveMQ Artemis cluster failover questions

I have a question in regards to Apache Artemis clustering with message grouping. This is also done in Kubernetes.
The current setup I have is 4 master nodes and 1 slave node. Node 0 is dedicated as LOCAL to handle message grouping and node 1 is the dedicated backup to node 0. Nodes 2-4 are REMOTE master nodes without backup nodes.
I've noticed that clients connected to nodes 2-4 is not failing over to the 3 other master nodes available when the connected Artemis node goes down, essentially not discovering the other nodes. Even after the original node comes back up, the client continues to fail to establish a connection. I've seen from a separate Stack Overflow post that master-to-master failover is not supported. Does this mean for every master node I need to create a slave node as well to handle the failover? Would this cause a two instance point of failure instead of however many nodes are within the cluster?
On a separate basic test using a cluster of two nodes with one master and one slave, I've observed that when I bring down the master node clients are connected to, the client doesn't failover to the slave node. Any ideas why?
As you note in your question, failover is only supported between a live and a backup. Therefore, if you wanted failover for clients which were connected to nodes 2-4 then those nodes would need backups. This is described in more detail in the ActiveMQ Artemis documentation.
It's worth noting that clustering and message grouping, while technically possible, is a somewhat odd pairing. Clustering is a way to improve overall message throughput using horizontal scaling. However, message grouping naturally serializes message consumption for each group (to maintain message order) which then decreases overall message throughput (perhaps severely depending on the use-case). A single ActiveMQ Artemis node can potentially handle millions of messages per second. It may be that you don't need the increased message throughput of a cluster since you're grouping messages.
I've often seen users simply assume they need a cluster to deal with their expected load without actually conducting any performance benchmarking. This can potentially lead to higher costs for development, testing, administration, and (especially) hardware, and in some use-cases it can actually yield worse performance. Please ensure you've thoroughly benchmarked your application and broker architecture to confirm the proposed design.

Apache Spark Auto Scaling properties - Add Worker on the Fly

During the execution of a Spark Program, let's say,
reading 10GB of data into memory, and just doing a filtering, a map, and then saving in another storage.
Can I auto-scale the cluster based on the load, and for instance add more Worker Nodes to the Program, if this program eventually needs to hangle 1TB instead of 10GB ?
If this is possible, how can it be done?
It is possible to some extent, using dynamic allocation, but behavior is dependent on the job latency, not direct usage of particular resource.
You have to remember that in general, Spark can handle data larger than memory just fine, and memory problems are usually caused by user mistakes, or vicious garbage collecting cycles. None of these could be easily solved, by "adding more resources".
If you are using any of the cloud platforms for creating the cluster you can use auto-scaling functionality. that will scale cluster horizontally(number of nodes with change)
Agree with #user8889543 - You can read much more data then your memory.
And as for adding more resources on the fly. It is depended on your cluster type.
I use standalone mode, and I have a code that add on the fly machines that attached to the master automatically, then my cluster has more cores and memory.
If you only have on job/program in the cluster then it is pretty simple. Just set
spark.cores.max
to a very high number and the job will take all the cores of the cluster always. see
If you have several jobs in the cluster it becomes complicate. as mentioned in #user8889543 answer.

Rebalance data after adding nodes

I'm using Cassandra 2.0.4 (with vnodes) and 2 days ago I added 2 nodes (.210 and .195.) I expected Cassandra to redistribute the existing data automatically, but today I still find this nodetool status
Issuing a nodetool repair on any of the nodes doesn't do anything either (the repair finishes within seconds.) The logs state that the repair is being executed as expected, but after preparing the repair plan it pretty much instantly finishes executing said plan.
Was I wrong to assume the existing data would be redistributed at all, or is something wrong? And if that isn't the case; how do I manually 'rebalance' the data?
Worth noting: I seem to have lost some data after adding this new nodes. Issuing a select on certain keys only returns data from the last couple of days rather than weeks, this makes me think the data is saved on .92 while Cassandra queries for it on one of the new servers. But that's really just an uneducated guess, I may have simple broken something during all of my trial & error tests meaning the data is actually gone (even though I don't issue deletes, ever.)
Can anyone enlighten me?
There is currently no manual rebalance option for vnode-enabled clusters.
But your cluster doesn't look unbalanced based on the nodetool status output you show. I'm curious as to why node .88 has only 64 tokens compared to the others but that isn't a problem per se. When a cluster is smaller there will be a slight variance in the balance of data across the nodes.
As for the data issues, you can try running nodetool repair -pr around the nodes in the ring and then nodetool cleanup and see if that helps.

Why ZooKeeper needs majority to run?

I've been wondering why ZooKeeper needs a majority of the machines in the ensemble to work at all. Lets say we have a very simple ensemble of 3 machines - A,B,C.
When A fails, new leader is elected - fine, everything works. When another one dies, lets say B, service is unavailable. Does it make sense? Why machine C cannot handle everything alone, until A and B are up again?
Since one machine is enough to do all the work (for example single machine ensemble works fine)...
Is there any particular reason why ZooKeeper is designed in this way? Is there a way to configure ZooKeeper that, for example ensemble is available always when at least one of N is up?
Edit:
Maybe there is a way to apply a custom algorithm of leader selection? Or define a size of quorum?
Thanks in advance.
Zookeeper is intended to distribute things reliably. If the network of systems becomes segmented, then you don't want the two halves operating independently and potentially getting out of sync, because when the failure is resolved, it won't know what to do. If you have it refuse to operate when it's got less than a majority, then you can be assured that when a failure is resolved, everything will come right back up without further intervention.
The reason to get a majority vote is to avoid a problem called "split-brain".
Basically in a network failure you don't want the two parts of the system to continue as usual. you want one to continue and the other to understand that it is not part of the cluster.
There are two main ways to achieve that one is to hold a shared resource, for instance a shared disk where the leader holds a lock, if you can see the lock you are part of the cluster if you don't you're out. If you are holding the lock you're the leader and if you don't your not. The problem with this approach is that you need that shared resource.
The other way to prevent a split-brain is majority count, if you get enough votes you are the leader. This still works with two nodes (for a quorum of 3) where the leader says it is the leader and the other node acting as a "witness" also agrees. This method is preferable as it can work in a shared nothing architecture and indeed that is what Zookeeper uses
As Michael mentioned, a node cannot know if the reason it doesn't see the other nodes in the cluster is because these nodes are down or there's a network problem - the safe bet is to say there's no quorum.
Let’s look at an example that shows how things can go wrong if the quorum (majority of running servers) is too small.
Say we have five servers and a quorum can be any set of two servers. Now say that servers s1 and s2 acknowledge that they have replicated a request to create a znode /z. The service returns to the client saying that the znode has been created. Now suppose servers s1 and s2 are partitioned away from the other servers and from clients for an arbitrarily long time, before they have a chance to replicate the new znode to the other servers. The service in this state is able to make progress because there are three servers available and it really needs only two according to our assumptions, but these three servers have never seen the new znode /z. Consequently, the request to create /z is not durable.
This is an example of the split-brain scenario. To avoid this problem, in this example the size of the quorum must be at least three, which is a majority out of the five servers in the ensemble. To make progress, the ensemble needs at least three servers available. To confirm that a request to update the state has completed successfully, this ensemble also requires that at least three servers acknowledge that they have replicated it.