Windows OS Update/Patch handling - best practices for SF today - azure-service-fabric

I'm aware that the SF doesn't yet automatically handle OS Upgrades/patching in any way like Cloud Services do. I eagerly await it when that is ready. But for now I am curious what I should expect by default.
Since SF uses Scale Sets and standard Windows VMs, should I expect that the instances will have the default Windows Update settings and thus will reboot automatically every so often as updates are applied? I believe the defaults are to install updates automatically and reboot during the defined maintenance window (3am?), is that correct?
If that is true, can I expect that SF will gracefully handle the reboot? By that I mean any services running on it are shutdown and the load balancer is notified to stop sending requests to any externally visible endpoints on that host?
But taking that a step further, if all of the above happens to be true, is there anything preventing all nodes in my cluster to hit the maintenance window and reboot at the same time? That would seem catastrophic to me.
Given all that, what is the best practice and general advice for handling Windows Updates in SF today?

You're correct that there could be catastrophic results if you just turn on Windows Update and let it go. There will be no coordination when the node reboots and you could lose part or all of your application or cluster if the nodes cause the service fabric services to lose quorum.
The only safe approach is to install the patches/updates on a single node at a time and don't move to the next node until the cluster is healthy. This can be scripted to make it easier or worst case can be done manually.
There may be another approach that has to do with adding nodetypes, but it is not yet tested, so I don't want to give details until we know it works.

Related

NestJS schedualers are not working in production

I have a BE service in NestJS that is deployed in Vercel.
I need several schedulers, so I have used #nestjs/schedule lib, which is super easy to use.
Locally, everything works perfectly.
For some reason, the only thing that is not working in my production environment is those schedulers. Everything else is working - endpoints, data base access..
Does anyone has an idea why? is it something with my deployment? maybe Vercel has some issue with that? maybe this schedule library requires something the Vercel doesn't have?
I am clueless..
Cold boot is the process of starting a computer from shutdown or a powerless state and setting it to normal working condition.
Which means that the code you deployed in a serveless manner, will run when the endpoint is called. The platform you are using spins up a virtual machine, to execute your code. And keeps the machine running for a certain period of time, incase you get another API hit, it's cheaper and easier on them to keep the machine running for lets say 5 minutes or 60 seconds, than to redeploy it on every call after shutting the machine when function execution ends.
So in your case, most likely what is happening is that the machine that you are setting the cron on, is killed after a period of time. Crons are system specific tasks which run in the kernel. But if the machine is shutdown, the cron dies with it. The only case where the cron would run, is if the cron was triggered at a point of time, before the machine was shut down.
Certain cloud providers give you the option to keep the machines alive. I remember google cloud used to follow the path of that if a serveless function is called frequently, it shifts from cold boot to hot start, which doesn't kill the machine entirely, and if you have traffic the machines stay alive.
From quick research, vercel isn't the best to handle crons, due to the nature of the infrastructure, and this is what you are looking for. In general, crons aren't for serveless functions. You can deploy the crons using queues for example or another third party service, check out this link by vercel.

Service Fabric - How to repair a failing stateful application

I have a stateful service that configures state backups for the primary replica on RunAsync using an Azure storage account.
The other day someone inadvertently deleted the storage account being used for backups. On our next deployment, the services began throwing errors as they initialize due to this 404 error response.
I have noticed that during a deployment fabric apparently shuffles around the old version of the service spinning up new primaries as needed to free up the vm it is upgrading. If the old version of the code fails to instantiate by throwing an exception, the upgrade process will fail causing a rollback.
My problem is, once I create a new storage account, I am still left seemingly no way to bring the existing services back to healthy states. My existing services are using Storage account urls with AccountKeys that no longer exists in azure. Attempts to upgrade fail because the old service instances can’t instantiate due to now bad configuration.
Are there any ways to deal with this situation?
The simplest thing would be to use an unmonitored manual upgrade to force through the change that would point the service to the new storage account.
However, this puts a lot of management overhead on you, particularly if there are many other services, since you need to be careful to perform all safety and functionality checks manually so as not to regress anything.
The recommend solution is to use the ServiceTypeHealthPolicyMap described here to "mask out" the unhealthy service (since you expect it to be unhealthy during the upgrade). You may also need to adjust some of the other upgrade parameters depending on the exact situation.
A third recommendation, or maybe something to improve in the future, would be to make the upgrade to change the account information a configuration only upgrade. This would ensure that SF tries to change the config in-place without restarting the services (by default), which would prevent the existing services from failing over during the upgrade and encountering issues. This is demonstrated in this example.

Two versions of fluentd fighting over port in my cluster

Somehow, I have 2 versions of fluentd running in my cluster:
They end up fighting over the same port, they just keep cranking away, trying to start up on that port, and it saturates all the CPU in the cluster.
unexpected error error_class=Errno::EADDRINUSE error="Address already in use - bind(2) for 0.0.0.0:24231
/opt/google-fluentd/embedded/lib/ruby/2.6.0/socket.rb:201:in 'bind'
I've tried deleting the daemon sets and deployments, they just keep coming back. Also tried ssh'ing into the machines and killing the process on that port. Nothing seems to work.
Obviously, I only want one version of fluentd to run (and I'm not even sure which one).
I seem to have fixed it. I went to GCP dashboard cluster edit page, Kubernetes Engine Monitoring dropdown was blank. It seems not even the dropdown could decide what to display here.
It seems the automated agent, or whatever, seriously messed up here, and had 2 versions of the logging and monitoring system running, fighting over a port, and crushing the CPU on every machine in the cluster. On top of that, I couldn't delete the daemon sets, pods, or deployments. It seems Google treats these as special somehow, maybe with some kind of automated agent, I don't know.
From the dropdown, I just selected System and workload logging and monitoring, saved, and it applied the changes.
Everything looking good so far, but this whole event has me worried, I didn't do anything. This just....happened.
This is a dev cluster, but if it was a production cluster...

Reliable Services seem to deactivate

I'm running into slowness in my stateful services that haven't had activity in awhile. It seems that the first call after some period of inactivity is incredibly slow (10+seconds). Subsequent calls do not suffer this problem. This seems to be a classic case of a service deactivating and waking up.
I'm aware that stateful actors do this, however, this is occurring for stateful services. This is being noticed in my dev and test clusters, where activity is sparse and inconsistent. For disclosure, these environments are running on the lowest resources possible (A0 vms, bronze tier availability). Regardless, I thought stateful services were supposed to remain always running.
How would I keep them warm and activated? Additionally, how would I diagnose what is actually happening?
Service Fabric doesn't do anything in terms of deactivating or putting services to sleep. Let's look at what a running named instance of a Reliable Service written in C# on Windows really is:
A .NET object instance running inside a process.
That's all. Service Fabric won't shut down the process if it's "idle" (whatever that means - Service Fabric has no such definition), and the object instance is strongly rooted so it won't be garbage collected.
So really all the same factors that would apply to any .NET application apply here.
If I had to take a guess - without knowing anything else about your application - the A0 VMs are most likely to blame. You have less than 1 GB of memory, so paging might be an issue. You have a fraction of a shared CPU core, so that might be an issue.
I never recommend using A0s. Not only because the extremely limited power can affect your services, but it can also affect the Service Fabric system services that keep your services alive and healthy.

why salt-cloud is so slow comparing to terraform?

I'm comparing salt-cloud and terraform as tools to manage our infrastructure at GCE. We use salt stack to manage VM configurations, so I would naturally prefer to use salt-cloud as an integral part of the stack and phase out terraform as a legacy thing.
However my use case is critical on VM deployment time because we offer PaaS solution with VMs deployed on customer request, so need to deliver ready VMs on a click of a button within seconds.
And what puzzles me is why salt-cloud takes so long to deploy basic machines.
I have created neck-to-neck simple test with deploying three VMs based on default CentOS7 image using both terraform and salt-cloud (both in parallel mode). And the time difference is stunning - where terraform needs around 30 seconds to deploy requested machines (which is similar to time needed to deploy through GCE GUI), salt-cloud takes around 220 seconds to deploy exactly same machines under same account in the same zone. Especially strange is that first 130 seconds salt-cloud does not start deploying and does seemingly nothing at all, and only after around 130 seconds pass it shows message deploying VMs and those VMs appear in GUI as in deployment.
Is there something obvious that I'm missing about salt-cloud that makes it so slow? Can it be sped up somehow?
I would prefer to user full salt stack, but with current speed issues it has I cannot really afford that.
Note that this answer is a speculation based on what I understood about terraform and salt-cloud, I haven't verified with an experiment!
I think the reason is that Terraform keeps state of the previous run (either locally or remotely), while salt-cloud doesn't keep state and so queries the cloud before actually provisioning anything.
These two approaches (keeping state or querying before doing something) are needed, since both tools are idempotent (you can run them multiple times safely).
For example, I think that if you remove the state file of Terraform and re-run it, it will assume there is nothing in the cloud and will actually instantiate a duplicate. This is not to imply that terraform does it wrong, it is to show that state is important and Terraform docs say clearly that when operating in a team the state should be saved remotely, exactly to avoid this kind of problem.
Following my line of though, this should also mean that if you either run salt-cloud in verbose debug mode or look at the network traffic generated by it, in the first 130 secs you mention (before it says "deploying VMs"), you should see queries from salt-cloud to the cloud provider to dynamically construct the state.
Last point, the fact that salt-cloud doesn't store the state of a previous run doesn't mean that it is automatically safe to use in a team environment. It is safe to use as long as no two team members run it at the same time. On the other hand, terraform with remote state on Consul allows for example to lock, so that team concurrent usage will always be safe.