Service Fabric upgrades keep active connections alive - azure-service-fabric

I am trying to upgrade an application deployed to service fabric.
How can I only upgrade nodes that have no active connections and wait for the busy nodes to finish before upgrading them?

Most of the time, you don't really have to worry about the upgrades on a node level as the SF runtime handles it internally if configured in Monitored mode. This is what we've been using with a high level of success and never really had to do much. This also fit our requirement that all upgrade domains (nodes) have to match our health state policies before considered healthy.
If you want to have more advanced control over your upgrades like using request draining etc, have a look at the info as mentioned here. But to be honest, we've been quite happy with just using monitored mode and investigating why stuff fails if it does. We had some apps that had a long background task running as a stateful actor that sometimes failed upgrade and most always it was due to an issue that was caused in the background task itself instead of anything to do with Service Fabric.
Service Fabric knew when no active connections and background tasks were running to then upgrade nodes and we could actually see the nodes that were temporarily 'stuck' due to waiting for an active background task to finish.

Related

Airflow fault tolerance

I have 2 questions:
first, what does it mean that the Kubernetes executor is fault tolerance, in other words, what happens if one worker nodes gets down?
Second question, is it possible that the whole Airflow server gets down? if yes, is there a backup that runs automatically to continue the work?
Note: I have started learning airflow recently.
Thanks in advance
This is a theoretical question that faced me while learning apache airflow, I have read the documentation
but it did not mention how fault tolerance is handled
what does it mean that the Kubernetes executor is fault tolerance?
Airflow scheduler use a Kubernetes API watcher to watch the state of the workers (tasks) on each change in order to discover failed pods. When a worker pod gets down, the scheduler detect this failure and change the state of the failed tasks in the Metadata, then these tasks can be rescheduled and executed based on the retry configurations.
is it possible that the whole Airflow server gets down?
yes it is possible for different reasons, and you have some different solutions/tips for each one:
problem in the Metadata: the most important part in Airflow is the Metadata where it's the central point used to communicate between the different schedulers and workers, and it is used to save the state of all the dag runs and tasks, and to share messages between tasks, and to store variables and connections, so when it gets down, everything will fail:
you can use a managed service (AWS RDS or Aurora, GCP Cloud SQL or Cloud Spanner, ...)
you can deploy it on your K8S cluster but in HA mode (doc for postgresql)
problem with the scheduler: the scheduler is running as a pod, and the is a possibility to lose depending on how you deploy it:
Try to request enough resources (especially memory) to avoid OOM problem
Avoid running it on spot/preemptible VMs
Create multiple replicas (minimum 3) for the scheduler to activate HA mode, in this case if a scheduler gets down, there will be other schedulers up
problem with webserver pod: it doesn't affect your workload, but you will not be able to access the UI/API during the downtime:
Try to request enough resources (especially memory) to avoid OOM problem
It's a stateless service, so you can create multiple replicas without any problem, if one gets down, you will access the UI/API using the other replicas

K8s graceful upgrade of service with long-running connections

tl;dr: I have a server that handles WebSocket connections. The nature of the workload is that it is necessarily stateful (i.e., each connection has long-running state). Each connection can last ~20m-4h. Currently, I only deploy new revisions of this service at off hours to avoid interrupting users too much.
I'd like to move to a new model where deploys happen whenever, and the services gracefully drain connections over the course of ~30 minutes (typically the frontend can find a "good" time to make that switch over within 30 minutes, and if not, we just forcibly disconnect them). I can do that pretty easily with K8s by setting gracePeriodSeconds.
However, what's less clear is how to do rollouts such that new connections only go to the most recent deployment. Suppose I have five replicas running. Normal deploys have an undesirable mode where a client is on R1 (replica 1) and then K8s deploys R1' (upgraded version) and terminates R1; frontend then reconnects and gets routed to R2; R2 terminates, frontend reconnects, gets routed to R3.
Is there any easy way to ensure that after the upgrade starts, new clients get routed only to the upgraded versions? I'm already running Istio (though not using very many of its features), so I could imagine doing something complicated with some custom deployment infrastructure (currently just using Helm) that spins up a new deployment, cuts over new connections to the new deployment, and gracefully drains the old deployment... but I'd rather keep it simple (just Helm running in CI) if possible.
Any thoughts on this?
This is already how things work with normal Services. Once a pod is terminating, it has already been removed from the Endpoints. You'll probably need to tune up your max burst in the rolling update settings of the Deployment to 100%, so that it will spawn all new pods all at once and then start the shutdown process on all the rest.

Windows OS Update/Patch handling - best practices for SF today

I'm aware that the SF doesn't yet automatically handle OS Upgrades/patching in any way like Cloud Services do. I eagerly await it when that is ready. But for now I am curious what I should expect by default.
Since SF uses Scale Sets and standard Windows VMs, should I expect that the instances will have the default Windows Update settings and thus will reboot automatically every so often as updates are applied? I believe the defaults are to install updates automatically and reboot during the defined maintenance window (3am?), is that correct?
If that is true, can I expect that SF will gracefully handle the reboot? By that I mean any services running on it are shutdown and the load balancer is notified to stop sending requests to any externally visible endpoints on that host?
But taking that a step further, if all of the above happens to be true, is there anything preventing all nodes in my cluster to hit the maintenance window and reboot at the same time? That would seem catastrophic to me.
Given all that, what is the best practice and general advice for handling Windows Updates in SF today?
You're correct that there could be catastrophic results if you just turn on Windows Update and let it go. There will be no coordination when the node reboots and you could lose part or all of your application or cluster if the nodes cause the service fabric services to lose quorum.
The only safe approach is to install the patches/updates on a single node at a time and don't move to the next node until the cluster is healthy. This can be scripted to make it easier or worst case can be done manually.
There may be another approach that has to do with adding nodetypes, but it is not yet tested, so I don't want to give details until we know it works.

Reliable Services seem to deactivate

I'm running into slowness in my stateful services that haven't had activity in awhile. It seems that the first call after some period of inactivity is incredibly slow (10+seconds). Subsequent calls do not suffer this problem. This seems to be a classic case of a service deactivating and waking up.
I'm aware that stateful actors do this, however, this is occurring for stateful services. This is being noticed in my dev and test clusters, where activity is sparse and inconsistent. For disclosure, these environments are running on the lowest resources possible (A0 vms, bronze tier availability). Regardless, I thought stateful services were supposed to remain always running.
How would I keep them warm and activated? Additionally, how would I diagnose what is actually happening?
Service Fabric doesn't do anything in terms of deactivating or putting services to sleep. Let's look at what a running named instance of a Reliable Service written in C# on Windows really is:
A .NET object instance running inside a process.
That's all. Service Fabric won't shut down the process if it's "idle" (whatever that means - Service Fabric has no such definition), and the object instance is strongly rooted so it won't be garbage collected.
So really all the same factors that would apply to any .NET application apply here.
If I had to take a guess - without knowing anything else about your application - the A0 VMs are most likely to blame. You have less than 1 GB of memory, so paging might be an issue. You have a fraction of a shared CPU core, so that might be an issue.
I never recommend using A0s. Not only because the extremely limited power can affect your services, but it can also affect the Service Fabric system services that keep your services alive and healthy.

Warmup services on upgrade in Service Fabric

We are wondering if there is a built-in way to warm up services as part of the service upgrades in Service Fabric, similar to the various ways you could warm up e.g. IIS based app pools before they are hit by requests. Ideally we want the individual services to perform some warm-up tasks as part of their initialization (could be cache loading, recovery etc.) before being considered as started and available for other services to contact. This warmup should be part of the upgrade domain processing so the upgrade process should wait for the warmup to be completed and the service reported as OK/Ready.
How are others handling such scenarios, controlling the process for signalling to the service fabric that the specific service is fully started and ready to be contacted by other services?
In the health policy there's this concept:
HealthCheckWaitDurationSec The time to wait (in seconds) after the upgrade has finished on the upgrade domain before Service Fabric evaluates the health of the application. This duration can also be considered as the time an application should be running before it can be considered healthy. If the health check passes, the upgrade process proceeds to the next upgrade domain. If the health check fails, Service Fabric waits for an interval (the UpgradeHealthCheckInterval) before retrying the health check again until the HealthCheckRetryTimeout is reached. The default and recommended value is 0 seconds.
Source
This is a fixed wait period though.
You can also emit Health events yourself. For instance, you can report health 'Unknown' while warming up. And adjust your health policy (HealthCheckWaitDurationSec) to check this.
Reporting health can help. You can't report Unknown, you must report Error very early on, then clear the Error when your service is ready. Warning and Ok do not impact upgrade. To clear the Error, your service can report health state Ok, RemoveWhenExpired=true, low TTL (read more on how to report).
You must increase HealthCheckRetryTimeout based on the max warm up time. Otherwise, if a health check is performed and cluster is evaluated to Error, the upgrade will fail (and rollback or pause, per your policy).
So, the order the events is:
your service reports Error - "Warming up in progress"
upgrade waits for fixed HealthCheckWaitDurationSec (you can set this to min time to warm up)
upgrade performs health checks: if the service hasn't yet warmed up, the health state is Error, so upgrade retries until either HealthCheckRetryTimeout is reached or your service is not in Error anymore (warm up completed and your service cleared the Error).