Vert.x unfair verticle redeployment after node crash - vert.x

I've been doing recently some experiments on the behavior of Vert.x and verticles in HA mode. I observed some weaknesses on how Vert.x dispatches the load on various nodes.
1. One node in a cluster crashes
Imagine a configuration with a cluster of some Vert.x nodes (say 4 or 5, 10, whatever), each having some hundreds or thousands verticles. If one node crashes, only one of the remaining nodes will restart all the verticles that had been deployed on the crashed node. Moreover, there is no guarantee that it will be the node with the smallest number of deployed verticles. This is unfair and in worst case, the same node would get all of the verticles from nodes that have crashed before, probably leading to a domino crash scenario.
2. Adding a node to a heavily loaded cluster
Adding a node to a heavily loaded cluster doesn't help to reduce the load on other nodes. Existing verticles are not redistributed on the new node and new verticles are created on the node that invokes the vertx.deployVerticle().
While the first point allows, within some limits, high availability, the second point breaks the promise of simple horizontal scalability.
I may be very possibly wrong: I may have misunderstood something or my configurations are maybe faulty. This question is about confirming this behavior and your advises about how to cope with it or point out my errors. Thanks in for your feedback.
This is how I create my vertx object:
VertxOptions opts = new VertxOptions()
.setHAEnabled(true)
;
// start vertx in cluster mode
Vertx.clusteredVertx(opts, vx_ar -> {
if (vx_ar.failed()) {
...
}
else {
vertx = vertx = vx_ar.result();
...
}
});
and this is how I create my verticles:
DeploymentOptions depOpt = new DeploymentOptions()
.setInstances(1).setConfig(prm).setHa(true);
// deploy the verticle
vertx
.deployVerticle("MyVerticle", depOpt, ar -> {
if(ar.succeeded()) {
...
}
else {
...
}
});
EDIT on 12/25/2019: After reading Alexey's comments, I believe I probably wasn't clear.
By promise of simple horizontal scalability I wasn't meaning that redistributing load upon insertion of a
new node is simple. I was meaning Vert.x promise to the developer that
what he needs to do to have his application to scale horizontally would be
simple. Scale is the very first argument on Vert.x home page, but, you're right, after re-reading carefully there's nothing about horizontal scaling on newly added nodes. I believe I was too much influenced by Elixir or Erlang. Maybe Akka provides this on the JVM, but I didn't try.
Regarding second comment, it's not (only) about the number of requests per second. The load I'm considering here is just the number of verticles "that are doing nothing else that waiting for a message". In a further experiment I can will make this verticle do some work and I will send an update. For the time being, imagine long living verticles that represent in memory actually connected user sessions on a backend. The system runs on 3 (or whatever number) clustered nodes each hosting few thousands (or whatever more) of sessions/verticles. From this state, I added a new node and waited until it is fully integrated in the cluster. Then I killed one of the first 3 nodes. All verticles are restarted fine but only on one node which, moreover, is not guaranteed to be the "empty" one. The destination node seems actually to be chosen at random : I did several tests and I have even observed verticles from all killed nodes being restarted on the same node. On a real platform with sufficient load, that would probably lead to a global crash.
I believe that implementing in Vert.x a fair restart of verticles, ie, distribute the verticles on all remaining nodes based on a given measure of their load (CPU, RAM, #of verticles, ...) would be simpler (not simple) than redistributing the load on a newly inserted node as that would probably require the capability for a scheduler to "steal" verticles from another one.
Yet, on a production system, not being "protected" by some kind of fair distribution of workload on the cluster may lead to big issues and as Vert.x is quite mature I was surprised by the outcome of my experiments, thus thinking I was doing something wrong.

Related

Hazelcast IMap Lock not working on kubernetes across different pods

We are using Hazelcast 4 to implement distributed locking across two pods on kuberentes.
We have developed distributed application, two pods of micro service has been created. Both instances are getting auto discovered and forming members.
We are trying to use IMap.lock(key) method to achieve distributed locking across two pods however both pods are acquiring lock at same time, thereby executing the business logic at the concurrently. Also hazelcast management center shows zero locks for the created Imap.
Can you please help on how to achieve synchronization of imap lock(key) so that single pod get the lock for given key at given point of time ?
Code Snippet:-
HazelcastInstance client = HazelcastClient.newHazelcastClient(clientConfig);
try{
IMap map = client.getMap("customers");
map.lock( key );
//business logic
} finally {
map.unlock( key );
}
}
Can you create an mvce and confirm the version of Hazelcast used please.
There are tests for locks here that you can perhaps use as a way to simplify to determine where the fault lies.

Difference between executing StreamTasks in the same instance v/s multiple instances

Say I have a topic with 3 partitions
Method 1: I run one instance of Kafka Streams, it starts 3 tasks [0_0,0_1,0_2] and each of these tasks consume from one partition.
Method 2: I spin up three instance of the same streams application, here again three tasks are started but now, it is distributed among the 3 instances that was created.
Which method is preferable and why?
In method 1 do all the tasks run as a part of the same thread, and in method 2, they run on different threads, or is it different?
Consider that the streams application has a very simple topology, and does only mapping of values from a single stream
By default, a single KafkaStreams instance runs one thread, thus in "Method 1" all three tasks are executed by a single thread. In "Method 2" each task is executed by its own thread. Note, that you can also configure multiple thread pre KafkaStreams instance via num.stream.threads configuration parameter. If you set it to 3 for "Method 1" both method are more or less the same. How many threads you need, depends on your workload, ie, how many messages you need to process per time unit and how expensive the computation is. It also depends on the hardware: for a single-core CPU, it may not make sense to configure more than one thread, but you should deploy multiple instances on multiple machines to get more hardware. Hence, if your workload is lightweight one single-threaded instance might be enough.
Also note, that you may be network bound. For this case, starting more thread would not help, but you want to scale out to multiple machines, too.
The last consideration is fault-tolerance. Even if a single thread/instance may be powerful enough to not lag, what should happen if the instance crashes? If you only have one instance, the whole computation goes down. If you run two instances, the second instance would take over all the work and your application stays online.

Service Fabric increasing thread count

We've noticed that the thread count of our Actors and WebApi increases over time, even when idle, in service fabric.
Details:
On-premise cluster using the 5.7.198.9494 version.
Development cluster (e.g all node types on 1 box, not multi-machine)
~100 unique actor ids; each with ~15 actor services (e.g UserActor, WishlistActor, etc)
Each of the actors services emits ActorEvents. These are subscribed to by the WebApi, which uses SignalR v2 to send the event data to UI clients.
My gut feeling is that I'm subscribing to ActorEvents incorrectly (Related question about subscribing to ActorEvents). Since it's difficult to know whether I've subscribed to the actor before, I might subscribe multiple times. However, I'm not seeing an increase in thread count when I do this, at least not immediately.
Looking at procexp the threads taking up CPU utilization are clr.dll!InstallCustomModule+0x1c00. There are also tons of ntdll.dll!RtReleaseSRWLockExclusive+0x1370 that take up very little cpu utilization as a single unit, but it adds up over a whole. They don't seem to be release.
Any ideas how to prevent this decay in usability?
Edit: Fixed procexp's name

Communication protocol

I'm developing distributed system that consists of master and worker servers. There should be 2 kind of messages:
Heartbeat
Master gets state of worker and respond immediately with appropriate command. For instance:
Message from Worker to Master: "Hey there! I have data a,b,c"
Response from Master to Worker: "All ok, But throw away c - we dont need this anymore"
The participants exchange this messages with interval T.
Direct master command
Lets say client asks master to kill job #123. Here is conversation:
Message from Master to Worker: "Alarm! We need to kill job #123"
Message from Worker to Master: "No problem! Done."
Obvious that we can't predict when this message appear.
Simplest solution is that master is initiator of all communications for both messages (in case of heartbeat we will include another one from master to start exchange). But lets assume that it is expensive to do all heartbeat housekeeping on master side for N workers. And we don't want to waste our resources to keep several tcp connections to worker servers so we have just one.
Is there any solution for this constraints?
First off, you have to do some bookkeeping somewhere. Otherwise, who's going to realize that a worker has died? The natural place to put that data is on the master, if you're building a master/worker system. Otherwise, the workers could be asked to keep track of each other in a long circle, or a randomized graph. If a worker notices that their accountabilibuddy is not responding anymore, it can alert the master.
Same thing applies to the list of jobs currently running; who keeps track of that? It also scales O(n), so presumably the master doesn't have space for that either. Sharding that data out among the workers (e.g. by keeping track of what things their accountabilibuddy is supposed to be doing) only works so far; if a and b crashes, and a is the only one looking after b, you just lost the list of jobs running on b (and possibly the alert that was supposed to notify you that b crashed).
I'd recommend a distributed consensus algorithm for this kind of task. For production, use something someone else has already written; they probably know what they're doing. If it's for learning purposes, which I presume, have a look at the raft consensus algorithm. It's not too hard to understand, but still highlights a lot of the complexity in distributed systems. The simulator is gold for proper understanding.
A master/worker system will never properly work with less than O(n) resources for n workers in the face of crashing workers. By definition, the master needs to control the workers, which is an O(n) job, even if some workers manage other workers. Also, what happens if the master crashes?
Like Filip Haglund said read the raft paper you should also implement it yourself. However in a nutshell what you need to extract from it would be this. In regaurds to membership management.
You need to keep membership lists and the masters Identity on all nodes.
Raft does it's heartbeat sending on master's end it is not very expensive network wise you don't need to keep them open. Every 200 ms to a second you need to send the heartbeat if they don't reply back the Master tells the slaves remove member x from list.
However what what to do if the master dies well basically you need to preset candidate nodes. If you haven't received a heart beat within the timeout the candidate requests votes from the rest of the cluster. If you get the slightest majority you become the new leader.
If you want to join a existing cluster basically same as above if not leader respond not leader with leaders address.

Service with background jobs, how to ensure jobs only run periodically ONCE per cluster

I have a play framework based service that is stateless and intended to be deployed across many machines for horizontal scaling.
This service is handling HTTP JSON requests and responses, and is using CouchDB as its data store again for maximum scalability.
We have a small number of background jobs that need to be run every X seconds across the whole cluster. It is vital that the jobs do not execute concurrently on each machine.
To execute the jobs we're using Actors and the Akka Scheduler (since we're using Scala):
Akka.system().scheduler.schedule(
Duration.create(0, TimeUnit.MILLISECONDS),
Duration.create(10, TimeUnit.SECONDS),
Akka.system().actorOf(LoggingJob.props),
"tick")
(etc)
object LoggingJob {
def props = Props[LoggingJob]
}
class LoggingJob extends UntypedActor {
override def onReceive(message: Any) {
Logger.info("Job executed! " + message.toString())
}
}
Is there:
any built in trickery in Akka/Actors/Play that I've missed that will do this for me?
OR a recognised algorithm that I can put on top of Couchbase (distributed mutex? not quite?) to do this?
I do not want to make any of the instances 'special' as it needs to be very simple to deploy and manage.
Check out Akka's Cluster Singleton Pattern.
For some use cases it is convenient and sometimes also mandatory to
ensure that you have exactly one actor of a certain type running
somewhere in the cluster.
Some examples:
single point of responsibility for certain cluster-wide consistent decisions, or coordination of actions across the cluster system
single entry point to an external system
single master, many workers
centralized naming service, or routing logic
Using a singleton should not be the first design choice. It has
several drawbacks, such as single-point of bottleneck. Single-point of
failure is also a relevant concern, but for some cases this feature
takes care of that by making sure that another singleton instance will
eventually be started.