How to do zero-downtime rolling updates for app with (long-lived) sticky sessions using containers - kubernetes

Am trying to figure out how to provide zero-downtime rolling updates of a webapp that has long-lived interactive user sessions that should be sticky, based on a JSESSIONID cookie.
For this (and other) reasons I'm looking at container technology, like say Docker Swarm or Kubernetes.
I am having difficulties finding a good answer on how to:
Make sure new sessions go to the latest version of the app
While existing sessions remain being served by whichever version of
the app they were initiated on
Properly clean up the old version once all sessions to/on it were
closed
Some more info:
Requests are linked to a session based on a JSESSIONID cookie
Sessions could potentially live for days, but am able to terminate them from within the app within say a 24hr timeframe (Sending the user a notification to "logout/login again as there is a new version or that they are otherwise automatically logged out at 12pm" for example)
Of course for each version of the app there are multiple containers already running in load-balanced fashion
I don't mind the number of total containers growing, for example if each of the old versions containers are all still up and running because they would all still host 1 session, while the majority of the users are already on the new version of the app
So, my idea of the required flow is something along these lines:
Put up the new version of the app
let all new connections (the ones w/o the JSESSIONID cookie set) go to the new version of the app once
a container of the old version of the app is not serving sessions
anymore, remove the container/....
As I mentioned, I'm looking into Kubernetes amd Docker Swarm, but open for other suggestions, but the end solution should be able to run on cloud platforms (currently using Azure, but Google or Amazon clouds might be used in the future)
Any pointers/tips/suggestions or ideas appreciated
Paul
EDIT:
In answer to #Tarun question and general clarification: yes, I want no downtime. The way I envision this is that the containers hosting the old version will keep running to serve all existing sessions. Once all sessions on the old servers have ended, the old server is removed.
The new containers are only going to serve new sessions for users that startup the app after the rollout of the new version has started.
So, to give an example:
- I launch a new session A of the old version of the app at 9am
- At 10am a new version is rolled out.
- I continue to use session A with remains hosted on a container running the old version.
- at noon I go for lunch and log out
- as I was the last session connected to the container running the old version, the container will now be destroyed
- at 1pm I come back, log back in and I get the new version of the app
Makes sense?

Your work load might not be a good fit for Kubernetes/containers ith its current architecture. The best way I can come up to solve this is it to move the state to PV/PVC and migrate the PV to the new containers so the new container can have state from the old session, now how to migrate the calls for that session to the proper node I'm not sure how to do that efficiently.
Ideally you would separate your data/caching layer from your service into something like redis and then it wouldn't matter which of the nodes service the request.

Related

Can a ReplicaSet be configured to allow in progress updates to complete?

I currently have a kubernetes setup where we are running a decoupled drupal/gatsby app. The drupal acts as a content repository that gatsby pulls from when building. Drupal is also configured through a custom module to connect to the k8s api and patch the deployment gatsby runs under. Gatsby doesn't run persistently, instead this deployment uses gatsby as an init container to build the site so that it can then be served by a nginx container. By patching the deployment(modifying a label) a new replicaset is created which forces a new gatsby build, ultimately replacing the old build.
This seems to work well and I'm reasonably happy with it except for one aspect. There is currently an issue with the default scaling behaviour of replica sets when it comes to multiple subsequent content edits. When you make a subsequent content edit within drupal it will still contact the k8s api and patch the deployment. This results in a new replicaset being created, the original replicaset being left as is, the previous replicaset being scaled down and any pods that are currently being created(gatsby building) are killed. I can see why this is probably desirable in most situations but for me this increases the amount of time that it takes for you to be able to see these changes on the site. If multiple people are using drupal at the same time making edits this will be compounded and could become problematic.
Ideally I would like the containers that are currently building to be able to complete and for those replicasets to finish scaling up, queuing another replicaset to be created once this is completed. This would allow any updates in the first build to be deployed asap, whilst queueing up another build immediately after to include any subsequent content, and this could continue for as long as the load is there to require it and no longer. Is there any way to accomplish this?
It is the regular behavior of Kubernetes. When you update a Deployment it creates new ReplicaSet and respectively a Pod according to new settings. Kubernetes keeps some old ReplicatSets in case of possible roll-backs.
If I understand your question correctly. You cannot change this behavior, so you need to do something with architecture of your application.

Service Fabric Application - changing instance count on application update fails

I am building a CI/CD pipeline to release SF Stateless Application packages into clusters using parameters for everything. This is to ensure environments (DEV/UAT/PROD) can be scoped with different settings.
For example in a DEV cluster an application package may have an instance count of 3 (in a 10 node cluster)
I have noticed that if an application is in the cluster and running with an instance count (for example) of 3, and I change the deployment parameter to anything else (e.g. 5), the application package will upload and register the type, but will fail on attempting to do a rolling upgrade of the running application.
This also works the other way e.g. if the running app is -1 and you want to reduce the count on next rolling deployment.
Have I missed a setting or config somewhere, is this how it is supposed to be? At present its not lending itself to being something that is easily scaled without downtime.
At its simplest form we just want to be able to change instance counts on application updates, as we have an infrastructure-as-code approach to changes, builds and deployments for full tracking ability.
Thanks in advance
This is a common error when using Default services.
This has been already answered multiple times in these places:
Default service descriptions can not be modified as part of upgrade set EnableDefaultServicesUpgrade to true
https://blogs.msdn.microsoft.com/maheshk/2017/05/24/azure-service-fabric-error-to-allow-it-set-enabledefaultservicesupgrade-to-true/
https://github.com/Microsoft/service-fabric/issues/253#issuecomment-442074878

Windows OS Update/Patch handling - best practices for SF today

I'm aware that the SF doesn't yet automatically handle OS Upgrades/patching in any way like Cloud Services do. I eagerly await it when that is ready. But for now I am curious what I should expect by default.
Since SF uses Scale Sets and standard Windows VMs, should I expect that the instances will have the default Windows Update settings and thus will reboot automatically every so often as updates are applied? I believe the defaults are to install updates automatically and reboot during the defined maintenance window (3am?), is that correct?
If that is true, can I expect that SF will gracefully handle the reboot? By that I mean any services running on it are shutdown and the load balancer is notified to stop sending requests to any externally visible endpoints on that host?
But taking that a step further, if all of the above happens to be true, is there anything preventing all nodes in my cluster to hit the maintenance window and reboot at the same time? That would seem catastrophic to me.
Given all that, what is the best practice and general advice for handling Windows Updates in SF today?
You're correct that there could be catastrophic results if you just turn on Windows Update and let it go. There will be no coordination when the node reboots and you could lose part or all of your application or cluster if the nodes cause the service fabric services to lose quorum.
The only safe approach is to install the patches/updates on a single node at a time and don't move to the next node until the cluster is healthy. This can be scripted to make it easier or worst case can be done manually.
There may be another approach that has to do with adding nodetypes, but it is not yet tested, so I don't want to give details until we know it works.

Production sailsjs app with no downtime with pm2

I have a sailsjs app running in cluster mode with pm2 and two instances. One of the main reasons for wanting the two instances was so I could restart/update the app without having to bring the entire app down.
However, in the middle of a restart of one instance pm2 restart 4, the site is all wonky (that's the technical term) if I refresh it. I'm assuming this is because grunt is doing it's thing and the .tmp folder gets destroyed for both instances?
Is the only real approach with sailsjs to have two complete instances running on different ports and use something like nginx as the load balancer, or am I missing something with PM2 that would allow for staged restarts without any downtime or hiccups in the resources being available?
There are a few issues here.
You need to provide what versions of sails.js/node.js/pm2 you're
running. In short, describe your environment as completely as
possible.
Describing your issue more completely helps people write more concise and clear answers.
node.js cluster mode may change (as of v0.12.4) and is still considered "Unstable": https://nodejs.org/api/cluster.html#cluster_cluster
In the following thread, "mikermcneil commented on Dec 3, 2014" says to disable Grunt for production with pm2: https://github.com/balderdashy/sails/pull/1716
Let me clarify by saying I've used pm2 until just recently. In addition to Grunt, it has issues with socket connections while nginx handles it just fine. Trust me, chasing down that bug was not fun. Here's a link to the thread: https://github.com/Unitech/PM2/issues/389
As an alternate solution I chose to use nginx with parallel sails.js apps, using redis for sockets and sessions. Use forever to keep the apps running and disable grunt. Point nginx to the assets folder to serve static files quickly, bypassing sails.js and add caching to those assets.
Hope this helps!

How exactly does the Heroku deployment process work?

When I deploy a new version of my service to Heroku, what happens exactly?
Suppose I have N web dynos online right now, M of them currently servicing requests.
Do all of them shut down before the new version starts coming online? What happens to any pending requests currently being serviced?
Is there downtime? (let's assume I just have a stateless service without any migrations)
Is there a hook for doing custom migrations (e.g. migrate database tables)?
Can I bring up N servers running the new version, have them start servicing requests, and bring the previous N servers down only once they're not servicing any requests?
Does the answer depend on the stack/language? (Aspen/Bamboo/Cedar, Ruby/Node.js/Java/...)
I didn't any official documentation about this, just contrary posts (some saying hot migrations are not possible, while others say there is no downtime). Are there any official details about the deployment process and the questions above?
Here is what happens during a Heroku deploy (current as of 10/20/2011*)[1]:
Heroku receives your git push
A new release is compiled from the latest version of your app and stored
[These happen roughly simultaneously]
The dyno grid is signalled to terminate[2] all running processes for your app
The dyno grid is signalled to start new processes for your app
The dyno grid is signalled to unidle all idle processes of your app
The HTTP router is signalled to begin routing HTTP traffic to the web dynos running the new release
The general takeaway is that to minimize any possible downtime you should minimize the boot time of your app.
By following careful migration practices, it is possible to push new code and then migrate while the app is running.
Here's an example for Rails: http://pedro.herokuapp.com/past/2011/7/13/rails_migrations_with_no_downtime/
To minimize dropped connections during a restart, use a webserve that responds appropriately to SIGTERM by beginning a graceful shutdown (finish existing connections, dont take new ones). Newer versions of thin will handle SIGTERM correctly.
This subject is the topic of much discussion internally and may
change in the future.
SIGTERM followed 10s later by SIGKILL if
still running
I can answer "Is there a hook for doing custom migrations (e.g. migrate database tables)?" part of this question. I've handled running migrations by writing a shell script that does a "heroku rake db:migrate" immediately after I issue "git push heroku". I don't know if there is a more "hook" - y way to do that.