Is there downtime when a partition is moved to a new node? - azure-service-fabric

Service Fabric offers the capability to rebalance partitions whenever a node is removed or added to the cluster. The Service Fabric Cluster Resource Manager will move one or more partitions to this node so more work can be done.
Imagine a reliable actor service which has thousands of actors running who are distributed across multiple partitions. If the Resource Manager decides to move one or more partitions, will this cause any downtime? Or does rebalancing partitions work the same as upgrading a service?

They act pretty much the same way, The main difference I can point is that Upgrades might affect only the services being updated, and re-balancing might affect multiple services at once. During an upgrade, the cluster might re-balance the services as well to fit the new service instance in a node.
Adding or Removing nodes I would compare more with node failures. In any of these cases they will be rebalanced because of the cluster capacity changes, not because of the service metric\load changes.
The main difference between a node failure and a cluster scaling(Add/remove node) is that the rebalance will take in account the services states during the process, when a infrastructure notification comes in telling that a node is being shutdown(for updates or maintenance, or scaling down) the SF will ask the Infrastructure to wait so it can prepare for this announced 'failure', and then start re-balancing the services.
Even though re-balancing cares about the service states for a scale down it should not be considered more reliable than a node failure, because the infrastructure will wait for a while before shutting down the node(the limit it can wait will depend on the reliability tier you defined for your cluster), until SF check if the services meet health conditions, like turn down services and creating new ones, checking if they will run fine without errors, if this process takes too long, these service might be killed once the timeout is reached and the infrastructure proceed with the changes, Also, the new instances of the services might fail on new nodes, forcing the services to move again.
When you design the services is safer to consider the re-balancing as a node failure, because at the end is not much different. Your services will move around, data stored in memory will be lost if not persisted, the service address will change, and etc. The services should have replicated data and the clients should always use a retry logic and refresh the services location to reduce the down time.

The main difference between service upgrade and service rebalancing is that during upgrade all replicas from all partitions are get turned off on particular node. According to documentation here balancing is done on replica basis i.e. only some replicas from some partitions will get moved, so there shouldn't be any outage.

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

How to stop the entire Akka Cluster (Sharding)

How to stop the ENTIRE cluster with sharding (spanning multiple machines - nodes) from one actor?
I know I can stop the actor system on 'this' node context.system.terminate()
I know I can stop the local Sharding Region.
I found .prepareForFullClusterShutdown() but it doesn't actually stop the nodes.
I suppose there is no single command to do that, but there must be some way to do this.
There's no out-of-the-box way to do this that I'm aware of: the overall expectation is that there's an external control plane (e.g. kubernetes) which manages this.
However, one could have an actor on every node of the cluster that listens for membership events and also subscribes to a pubsub topic. This actor would track the current cluster membership and, when told to begin a cluster shutdown, it publishes a (e.g.) ShutdownCluster message to the topic and tracks which nodes leave. After some length of time (since distributed pubsub is at-most-once) if there are nodes besides this one that haven't left, it sends it again. Eventually, after all other nodes in the cluster have left, this actor then shuts down its node. When other nodes see a ShutdownCluster message, they immediately shut themselves down.
Of course, this sort of scheme will probably not play nicely with any form of external orchestration (whether it's a container scheduler like kubernetes, mesos, or nomad; or even something simple like monit which notices that the service isn't running and restarts it).

Fair distribution of verticles on a Vert.x cluster of nodes

I've been experimenting with Vert.x high availability features to test horizontal scalability and resiliency. I have a cluster of several nodes based on Hazelcast. I'm creating verticles on any nodes via an HTTP API. Verticles have the HA flag set when they are created.
Testing scalability
If I have n nodes Nn loaded with HA-verticles and if I add one additional node there is no verticle that is migrated from the Nn node on the new one so that the load would be balanced. Is there a way to tell Vert.x to do so, or not ? I believe it's not so simple...
Testing resilience
If I have n nodes Nn loaded with HA-verticles and I kill one of the nodes, all the verticles from that very node are migrated, but are migrated on one single of the remaining nodes that is not always the least loaded one. That destination node may become overloaded and the whole cluster would be at risk of freeze or crash. Same question as before: is there a way to force Vert.x to balance the restarted verticles on all nodes, or at least on the node that is the least loaded ?
Your observations are correct, there is no way:
to distribute verticles from a failed node over the rest of the nodes
to prevent starting verticles in a node that is already loaded
Improving the HA features is not on the Vert.x roadmap.
If, as it seems, you need more than basic failover, I would recommend to use specialized infrastructure tools that can leverage info from monitoring systems and start/stop new nodes as needed.

How do I setup a Active / Passive environment with two nodes in OpenShift?

I am trying to configure a Active/Passive cluster with two nodes (using OpenShift). The second passive node should be a hot standby, in other words it is up and running but not doing anything, until the first node dies. Then the passive node becomes active and a new passive node is started.
I have read the High Availability documentation, however it just seems to cover the theory. Furthermore it seems like overkill ( I am thinking there might be an easier way to meet my goal).
Where would I start?
What you are asking for goes against the usual practice for how Kubernetes/OpenShift is used. You wouldn't have hot standby nodes, you would always use all nodes in the cluster. You would then allow for enough additional capacity in your cluster such that loosing a node doesn't cause a problem as other nodes would have enough capacity to then run the applications. In this scenario the Kubernetes scheduler would automatically restart any applications which were on a failed node on the other nodes in the cluster, without you needing to perform any explicit failover steps.
So don't try and do anything special, setup your cluster with the two nodes, with applications being distributed across both. If you need to have the ability to run with only a single node, make sure it has enough capacity to run everything. If over time you add more applications and one node is not enough, add a third node, with all three being used in normal case. You can then handle failure of a single node again.

Zooker Failover Strategies

We are young team building an applicaiton using Storm and Kafka.
We have common Zookeeper ensemble of 3 nodes which is used by both Storm and Kafka.
I wrote a test case to test zooker Failovers
1) Check all the three nodes are running and confirm one is elected as a Leader.
2) Using Zookeeper unix client, created a znode and set a value. Verify the values are reflected on other nodes.
3) Modify the znode. set value in one node and verify other nodes have the change reflected.
4) Kill one of the worker nodes and make sure the master/leader is notified about the crash.
5) Kill the leader node. Verify out of other two nodes, one is elected as a leader.
Do i need i add any more test case? additional ideas/suggestion/pointers to add?
From the documentation
Verifying automatic failover
Once automatic failover has been set up, you should test its operation. To do so, first locate the active NameNode. You can tell which node is active by visiting the NameNode web interfaces -- each node reports its HA state at the top of the page.
Once you have located your active NameNode, you may cause a failure on that node. For example, you can use kill -9 to simulate a JVM crash. Or, you could power cycle the machine or unplug its network interface to simulate a different kind of outage. After triggering the outage you wish to test, the other NameNode should automatically become active within several seconds. The amount of time required to detect a failure and trigger a fail-over depends on the configuration of ha.zookeeper.session-timeout.ms, but defaults to 5 seconds.
If the test does not succeed, you may have a misconfiguration. Check the logs for the zkfc daemons as well as the NameNode daemons in order to further diagnose the issue.
more on setting up automatic failover