Failsave Kubernetes with preemtible VMs in Google Cloud - kubernetes

Google has an offer called preemptible VMs which are VMs that do not guarantee to be available all the time and which are shut down once every 24 hours.
Our goal is to deploy a failsave (to a certain degree) kubernetes cluster with those VM by having enough backup VMs to handle the case that one VM is shutdown. This article describe a simple scenario where preemptible VMs are used to run an image service. This scenario is simple because there is no database or message broker involved running on preemptible VMs.
Is it possible to run a whole (microservice-based) application including databases and message brokers with only preemptible VMs?
Further Questions that we have:
When do the preemptible VMs get shut down usually? Is it usually the case that if one VM gets shut down, all the other are too (at the same time)?
How log is the downtime of a preemptible VM getting restarted?
Any guidance that helps answering those questions and/or helping us configure such a cluster is apprechiated.

Regarding your questions:
1.-When do the preemptible VMs get shut down usually? Is it usually the case that if one VM gets shut down, all the other are too (at the same time)?
A:The lifetime for preemptible VMs is no more than 24 hours, they can be shutdown whenever Google need the resources between this lifetime, find more information about limitations here. Resetting the counter means that you manually stop and start the instances, however keep in mind that the selection will preempt instances that were recently launched most recently.
2.-How log is the downtime of a preemptible VM getting restarted?
A: If you mean where you can see the logs of Compute Engine that notices you when an instance was terminated, you can use Stackdriver Logging.

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

How to prevent data inconsistency when one node lose network connectivity in kubernetes

I have a situation where I have a cluster with a service (we named it A1) and its data which is on a remote storage like cephFS in my case. the number of replica for my service is 1. Assume I have 5 node in my cluster and service A1 reside in node 1. something happens with node 1 network and it lose the connectivity with cephFS cluster and my Kubernetes cluster as well (or docker-swarm). cluster mark it as unreachable and start a new service (we named it A2) on node 2 to keep replica as 1. after for example 15 min node 1 network fixed and node 1 get back to cluster and have service A1 running already (assume it didn't crash while it loses its connectivity with remote storage).
I worked with docker-swarm and recently switched to Kubernetes. I see Kuber has a feature call StatefulSet but when I read about it. it doesn't answer my question. (or I may miss something when I read about it)
Question A: what does cluster do. does it keep A2 and shutdown A1 or let A1 keeps working and shutdown A2 (Logically it should shutdown A1)
Question B (and my primary question as well!): Assume that the cluster wants to shutdown on of these services (for example A1). This service does some save on storage when it wants to shutdown. in this case state A1 save to disk and A2 with newer state saved something before A1 network get fixed.
There must be some locks when we mount the volume to the container in which when it attached to one container other container cant write to that (let A1 failed when want to save its old state data on disk)
The way it works - using docker swarm terminology -
You have a service. A service is a description of some image you'd like to run, how many replicas and so on. Assuming the service specifies at least 1 replica should be running it will create a task that will schedule a container on a swarm node.
So the service is associated with 0 to many tasks, where each task has 0 - if its still starting or 1 container - if the task is running or stopped - which is on a node.
So, when swarm (the orcestrator) detects a node go offline, it principally sees that a number of tasks associated with a service have lost their containers, and so the replication (in terms of running tasks) is no longer correct for the service, and it creates new tasks which in turn will schedule new containers on the available nodes.
On the disconnected node, the swarm worker notices that it has lost connection to the swarm managers so it cleans up all the tasks it is holding onto as it no longer has current information about them. In the process of cleaning the tasks up, the associated containers get stopped.
This is good because when the node finally reconnects there is no race condition where there are two tasks running. Only "A2" is running and "A1" has been shut down.
This is bad if you have a situation where nodes can lose connectivity to the managers frequently, but you need the services to keep running on those nodes regardless, as they will be shut down each time the workers detach.
The process on K8s is pretty much the same just change the terminology.

Is there downtime when a partition is moved to a new node?

Service Fabric offers the capability to rebalance partitions whenever a node is removed or added to the cluster. The Service Fabric Cluster Resource Manager will move one or more partitions to this node so more work can be done.
Imagine a reliable actor service which has thousands of actors running who are distributed across multiple partitions. If the Resource Manager decides to move one or more partitions, will this cause any downtime? Or does rebalancing partitions work the same as upgrading a service?
They act pretty much the same way, The main difference I can point is that Upgrades might affect only the services being updated, and re-balancing might affect multiple services at once. During an upgrade, the cluster might re-balance the services as well to fit the new service instance in a node.
Adding or Removing nodes I would compare more with node failures. In any of these cases they will be rebalanced because of the cluster capacity changes, not because of the service metric\load changes.
The main difference between a node failure and a cluster scaling(Add/remove node) is that the rebalance will take in account the services states during the process, when a infrastructure notification comes in telling that a node is being shutdown(for updates or maintenance, or scaling down) the SF will ask the Infrastructure to wait so it can prepare for this announced 'failure', and then start re-balancing the services.
Even though re-balancing cares about the service states for a scale down it should not be considered more reliable than a node failure, because the infrastructure will wait for a while before shutting down the node(the limit it can wait will depend on the reliability tier you defined for your cluster), until SF check if the services meet health conditions, like turn down services and creating new ones, checking if they will run fine without errors, if this process takes too long, these service might be killed once the timeout is reached and the infrastructure proceed with the changes, Also, the new instances of the services might fail on new nodes, forcing the services to move again.
When you design the services is safer to consider the re-balancing as a node failure, because at the end is not much different. Your services will move around, data stored in memory will be lost if not persisted, the service address will change, and etc. The services should have replicated data and the clients should always use a retry logic and refresh the services location to reduce the down time.
The main difference between service upgrade and service rebalancing is that during upgrade all replicas from all partitions are get turned off on particular node. According to documentation here balancing is done on replica basis i.e. only some replicas from some partitions will get moved, so there shouldn't be any outage.

How is the preemption notice handled?

I'm currently running on AWS and use kube-aws/kube-spot-termination-notice-handler to intercept an AWS spot termination notice and gracefully evict the pods.
I'm reading this GKE documentation page and I see:
Preemptible instances terminate after 30 seconds upon receiving a preemption notice.
Going into the Compute Engine documentation, I see that a ACPI G2 Soft Off is sent 30 seconds before the termination happens but this issue suggests that the kubelet itself doesn't handle it.
So, how does GKE handle preemption? Will the node do a drain/cordon operation or does it just do a hard shutdown?
Yes you are right, so far there is no built in way to handle ACPI G2 Soft Off.
Notice that if normal preemptible instance supports shutdown scripts (where you could introduce some kind of logic to perform drain/cordon), this is not the case if they are Kubernetes nodes:
Currently, preemptible VMs do not support shutdown scripts.
You can perform some test but quoting again from documentation:
You can simulate an instance preemption by stopping the instance.
And so far if you stop the instance, even if it is a Kubernetes node no action is taken to cordon/drain and gratefully remove the node from the cluster.
However this feature is still in beta therefore it is at its early stage of life and in this moment it is a matter of discussion if and how introduce this feature.
Disclaimer: I work For Google Cloud Platform Support
More recent and relevant answer
There's a GitHub project (not mine) that catches this ACPI handler and has the node cordon and drain itself, and then restart itself which in our tests results in a much cleaner preemptible experience, it's almost not noticeable with a highly available deployments on your cluster.
See: https://github.com/GoogleCloudPlatform/k8s-node-termination-handler

How do I setup a Active / Passive environment with two nodes in OpenShift?

I am trying to configure a Active/Passive cluster with two nodes (using OpenShift). The second passive node should be a hot standby, in other words it is up and running but not doing anything, until the first node dies. Then the passive node becomes active and a new passive node is started.
I have read the High Availability documentation, however it just seems to cover the theory. Furthermore it seems like overkill ( I am thinking there might be an easier way to meet my goal).
Where would I start?
What you are asking for goes against the usual practice for how Kubernetes/OpenShift is used. You wouldn't have hot standby nodes, you would always use all nodes in the cluster. You would then allow for enough additional capacity in your cluster such that loosing a node doesn't cause a problem as other nodes would have enough capacity to then run the applications. In this scenario the Kubernetes scheduler would automatically restart any applications which were on a failed node on the other nodes in the cluster, without you needing to perform any explicit failover steps.
So don't try and do anything special, setup your cluster with the two nodes, with applications being distributed across both. If you need to have the ability to run with only a single node, make sure it has enough capacity to run everything. If over time you add more applications and one node is not enough, add a third node, with all three being used in normal case. You can then handle failure of a single node again.