Zero downtime Google Compute Engine (GCE) deployment - deployment

I'm trying to deploy this docker GCE project in a deploy.yaml but every time I update my git repository, the server goes down due to 1.
The original instance being deleted and 2. The new instance hasn't finished starting up yet (or at least the web app hasn't finished starting up yet).
What command should I use or how should I change this so that I have a canary deployment that destroys the old instances once a new one is up (I only have one instance running at a time)? I have no health checks on the instance group, only the load balancer.
- name: 'gcr.io/cloud-builders/gcloud'
args: ['compute', 'instance-groups', 'managed', 'rolling-action', 'replace', 'a-group', '--max-surge', '1']
Thanks for the help!

Like John said - you can set max-unavailable and max-surge variables to alter the behavior of your deployment during updates.

Related

Whole Application level rolling update

My kubernetes application is made of several flavors of nodes, a couple of “schedulers” which send tasks to quite a few more “worker” nodes. In order for this app to work correctly all the nodes must be of exactly the same code version.
The deployment is performed using a standard ReplicaSet and when my CICD kicks in it just does a simple rolling update. This causes a problem though since during the rolling update, nodes of different code versions co-exist for a few seconds, so a few tasks during this time get wrong results.
Ideally what I would want is that deploying a new version would create a completely new application that only communicates with itself and has time to warm its cache, then on a flick of a switch this new app would become active and start to get new client requests. The old app would remain active for a few more seconds and then shut down.
I’m using Istio sidecar for mesh communication.
Is there a standard way to do this? How is such a requirement usually handled?
I also had such a situation. Kubernetes alone cannot satisfy your requirement, I was also not able to find any tool that allows to coordinate multiple deployments together (although Flagger looks promising).
So the only way I found was by using CI/CD: Jenkins in my case. I don't have the code, but the idea is the following:
Deploy all application deployments using single Helm chart. Every Helm release name and corresponding Kubernetes labels must be based off of some sequential number, e.g. Jenkins $BUILD_NUMBER. Helm release can be named like example-app-${BUILD_NUMBER} and all deployments must have label version: $BUILD_NUMBER . Important part here is that your Services should not be a part of your Helm chart because they will be handled by Jenkins.
Start your build with detecting the current version of the app (using bash script or you can store it in ConfigMap).
Start helm install example-app-{$BUILD_NUMBER} with --atomic flag set. Atomic flag will make sure that the release is properly removed on failure. And don't delete previous version of the app yet.
Wait for Helm to complete and in case of success run kubectl set selector service/example-app version=$BUILD_NUMBER. That will instantly switch Kubernetes Service from one version to another. If you have multiple services you can issue multiple set selector commands (each command executes immediately).
Delete previous Helm release and optionally update ConfigMap with new app version.
Depending on your app you may want to run tests on non user facing Services as a part of step 4 (after Helm release succeeds).
Another good idea is to have preStop hooks on your worker pods so that they can finish their jobs before being deleted.
You should consider Blue/Green Deployment strategy

Reset / Rollback Kubernetes to just create state?

i am trying to setup a complete GitLab Routine to setup my Kubernetes Cluster with all installations and configurations automatically incl. decommissioning at the end.
However the Creation and Decommissioning Progress is one of the most time consuming because i am basically waiting for the provisioning till i can execute further commands.
as i have some times troubles in the bases of the Kubernetes Setup, i currently decomission my cluster and create a new one. But this is pretty un-economical and time consuming.
Question:
Is there a command or a series of commands to completely reset a Kubernetes to his state after creation ?
The closest is probably to do all your work in a new namespace and then delete that namespace when you are done. That automatically deletes all objects inside it.

Can a ReplicaSet be configured to allow in progress updates to complete?

I currently have a kubernetes setup where we are running a decoupled drupal/gatsby app. The drupal acts as a content repository that gatsby pulls from when building. Drupal is also configured through a custom module to connect to the k8s api and patch the deployment gatsby runs under. Gatsby doesn't run persistently, instead this deployment uses gatsby as an init container to build the site so that it can then be served by a nginx container. By patching the deployment(modifying a label) a new replicaset is created which forces a new gatsby build, ultimately replacing the old build.
This seems to work well and I'm reasonably happy with it except for one aspect. There is currently an issue with the default scaling behaviour of replica sets when it comes to multiple subsequent content edits. When you make a subsequent content edit within drupal it will still contact the k8s api and patch the deployment. This results in a new replicaset being created, the original replicaset being left as is, the previous replicaset being scaled down and any pods that are currently being created(gatsby building) are killed. I can see why this is probably desirable in most situations but for me this increases the amount of time that it takes for you to be able to see these changes on the site. If multiple people are using drupal at the same time making edits this will be compounded and could become problematic.
Ideally I would like the containers that are currently building to be able to complete and for those replicasets to finish scaling up, queuing another replicaset to be created once this is completed. This would allow any updates in the first build to be deployed asap, whilst queueing up another build immediately after to include any subsequent content, and this could continue for as long as the load is there to require it and no longer. Is there any way to accomplish this?
It is the regular behavior of Kubernetes. When you update a Deployment it creates new ReplicaSet and respectively a Pod according to new settings. Kubernetes keeps some old ReplicatSets in case of possible roll-backs.
If I understand your question correctly. You cannot change this behavior, so you need to do something with architecture of your application.

Service Fabric Application - changing instance count on application update fails

I am building a CI/CD pipeline to release SF Stateless Application packages into clusters using parameters for everything. This is to ensure environments (DEV/UAT/PROD) can be scoped with different settings.
For example in a DEV cluster an application package may have an instance count of 3 (in a 10 node cluster)
I have noticed that if an application is in the cluster and running with an instance count (for example) of 3, and I change the deployment parameter to anything else (e.g. 5), the application package will upload and register the type, but will fail on attempting to do a rolling upgrade of the running application.
This also works the other way e.g. if the running app is -1 and you want to reduce the count on next rolling deployment.
Have I missed a setting or config somewhere, is this how it is supposed to be? At present its not lending itself to being something that is easily scaled without downtime.
At its simplest form we just want to be able to change instance counts on application updates, as we have an infrastructure-as-code approach to changes, builds and deployments for full tracking ability.
Thanks in advance
This is a common error when using Default services.
This has been already answered multiple times in these places:
Default service descriptions can not be modified as part of upgrade set EnableDefaultServicesUpgrade to true
https://blogs.msdn.microsoft.com/maheshk/2017/05/24/azure-service-fabric-error-to-allow-it-set-enabledefaultservicesupgrade-to-true/
https://github.com/Microsoft/service-fabric/issues/253#issuecomment-442074878

AWS Mongo QuickStart never completes

Problem
I am trying to complete the MongoDB on AWS quickstart to create a simple MongoDB cluster. Unfortunately it never completes the rollout, cancelling after one last installation part (PrimaryReplicaNodeXYWaitForNodeInstallGP2) has not been completed within an hour.
Background
My Settings were the following:
AvailabilityZone0 eu-central-1a
AvailabilityZone1 eu-central-1b
AvailabilityZone2 eu-central-1b
BuildBucket quickstart-reference/mongodb/latest
ClusterReplicaSetCount 0
ClusterShardCount 1
ConfigServerInstanceType t2.micro
Iops 100
KeyName my_definitely_working_keypair
MongoDBVersion 3.2
NATInstanceType t2.small
NodeInstanceType m3.medium
PrimaryReplicaSubnet 10.0.2.0/24
PublicSubnet 10.0.1.0/24
RemoteAccessCIDR XXX.XXX.0.0/16
SecondaryReplicaSubnet0 10.0.3.0/24
SecondaryReplicaSubnet1 10.0.4.0/24
ShardsPerNode 0
VolumeSize 40
VolumeType gp2
VPCCIDR 10.0.0.0/16
Which caused a rollback in the same behaviour, as named in the AWS forum:
In "Ressources", all but one subtask never gets completed and stays on
forever as "PrimaryReplicaNode0WaitForNodeInstallGP2 -
PrimaryReplicaNode0WaitForNodeInstallWaitHandle - Created in Progress
- Ressource creation initiated"
So, I was further researching on the issue. The post referred to another forum thread, where users with the problem should try to delete their DynamoDB entries and set ClusterReplicaSetCount to 3.
Problem here: In DynamoDB there are no entries and changing ClusterReplicaSetCount to 3 also causes a rollback with a similar error:
ConfigServer2WaitForNodeInstall WaitCondition timed out. Received 0
conditions when expecting 1
and later
MONGODBSTACK1 The following resource(s) failed to create:
[ConfigServer1WaitForNodeInstall,
PrimaryReplicaNode00WaitForNodeInstallGP2,
ConfigServer0WaitForNodeInstall,
SecondaryReplicaNode00WaitForNodeInstallGP2,
SecondaryReplicaNode01WaitForNodeInstallGP2,
ConfigServer2WaitForNodeInstall].
Summary
In both cases there is a fail on PrimaryReplicaNodeXYWaitForNodeInstallGP2 (where XY is the number of the node) while all other parts of the installation completed successfully. I am totally in the dark.
Anyone got around this? The quick start is from 2016 and I think there must be people, who have successfully created this mongo stack!?
After days and days of hard struggle and no solution there was an update (since over a year, feels as if my prayers were heard) on the manual and the template:
https://docs.aws.amazon.com/quickstart/latest/mongodb/welcome.html
So this also comes with a completely revised infrastructure and a more sophisticated setup form, changes are described as:
Upgraded MongoDB to version 3.4; removed sharding configuration;
updated security groups and added database security; updated
parameters
Following the tutorial is quite similar to the former versions, so no struggle here.
Everything went out fine and I got my stack completed now consisting of a
mongoDB
mongoDB Replica
bastion stack
vpc stack
So this part is basically done. If something else comes up, I will open a new question for that.
I noticed this after tearing down a dev cluster and attempting to stand up a new one with the same name.
The torn down cluster orphaned a dynamodb table with the name the new stack was trying to publish the worker nodes status onto. I deleted this dynamo table manually and reattempted to spin up the stack with the same name a third time and had success.