Zero downtime deployment of Slack bot - deployment

We develop bot with BotKit and now we try to solve problem with minimal deployment downtime.
There are the server and docker container running on this server. Inside container run bot-app instance connected with RTM-server (Slack).
When I start to deploy new version (v2) of bot-app, I want to get zero downtime, users should not see "bot is offline".
Deploy script runs second docker container with a new version of bot-app. And bot-app connect to RTM-server too. In this way, there are few seconds, when both apps run, connected to RTM-server and responds to user commands (and a user will to see two answers to his command).
What optimal decision I can get if on the one hand we want to get zero downtime and on the other hand, we want to prevent the user interact with the two instances at the same time?
Decision 1:
To allow small chance the likelihood of a collision, when both instances will respond to the user command.
Decision 2:
Abandon the zero downtime deployment. In this case, deploy script first stops the first docker-container, then start another one. The app will not respond to user commands, sent between stopping current version of the app and fully starting of a new version of an app.
Decision 3:
With an interact of parallel run current and new version of app or mutexes. General schematic:
1) Current version of app is running
2) Deploy script starts new version of app
3) I time when a new version of app almost run and ready to connect to RTM-server, it send to current version app command to close RTM-connection.
4) Current version of app closes RTM-connection
5) New version of app open RTM-connection
I think there are other good solutions.
How would you have solved this problem in your application?

(Sorry for the second reply; had another idea.)
The approach I described earlier would be pretty disruptive to your existing code, since you'd probably need to stop using botkit (or at least not use it to do the RTM API communication). An approach that may be less disruptive would be to use some sort of external way to signal that a given message is already been processed.
For example, using Redis, have the bot do the following command when a message comes in:
SET message:<message timestamp> 1 NX PX 30000
The NX option means this command will only succeed if the key doesn't already exist. So the first instance of the bot that manages to execute this will succeed, and the other instance will fail. The bot should only process the message and respond if this command succeeded.
(The PX 30000 sets a 30-second expiration so Redis doesn't get full of these keys.)
This should let you do your zero-downtime upgrades via overlapping the running bot instances without having to worry about a message being processed twice.
Note that it's still possible in this scheme for a message to be dropped altogether if a bot is shut down in a non-graceful way. (It could die just after calling the SET command but before it's actually dealt with the message.) A real queue with a two-phase "get/delete" would be better, but then you're back to my other answer. :-)

One idea I would consider is separating into two components:
A component that keeps a WebSocket connected to the Slack RTM API. This component simply reads messages from the API and puts them on to a queue. (Let's call this the "queuer.")
The actual "bot," which reads messages from the queue and responds as needed.
Depending on how your bot behaves, it can use the Web API directly or perhaps put its own messages on an outbound queue which the "queuer" can send via the RTM API.
This architecture probably solves your problem... you can now either take the bot down briefly while upgrading—responses will just be delayed until the new version is running—or you can run two versions of the bot at the same time and rely on the semantics of the queue to prevent both versions from responding to the same message.

Related

Continuous deployment: how to deploy new features that are affecting client and server at the same time?

The server holds logic, iOS/Android App holds UI. Common case.
How do I suppose to deploy new features in this case with continuous deployment methodology?
I assume that server-side deploy looks like that:
I'm triggering new feature deployment, load balancer starts redirecting 1% of all users to the server instance with the new feature. If everything goes smoothly, then load balancer starts redirecting 10%, 30%, etc up to 100%.
The same can be done for client apps, using, say, Codepush.
So, if I'll deploy server without an app, then there will be no new features usage and therefore no problems with new deployment for sure.
So, probably I have to deploy app first and put some kind of server version checker, so if the server has api for this new feature, the UI for this feature is being shown, and if the app is connected to the wrong server, the new UI is hidden.
That's seems primitive. I need to persist socket connection to the same server to avoid hitting the wrong server, right? And what if instance/zone/region will go down and the user will be suddenly redirected to another sone/region and new server will not have the new feature api? Probably, my assumption is wrong.
So, how do I suppose to deploy new features in this case with continuous deployment methodology?
I would say that your question is more of version compatibility nature of server/client API than CD. We have a similar requirement where a server and the clients communicate and both are constantly enhanced with features. I don't know your production software architecture which might change the needs accordingly but I'll try to come up with some ideas.
I'm going to describe two cases which might apply for you.
First case:
The thing is easier when you do not face the situation that new client versions need to communicate with old server versions. The new server version is deployed first and old clients simply do not use the new feature, as you've already pointed out. In this situation my recommendation is to deploy the server app first and then start to roll out the new client apps. If that's possible I would do that. It applies only when the new feature doesn't force you to break the API.
Second case:
In the case that new client app versions need to talk to an old server app, which I would try to avoid at all costs, the new client needs some switch inside to deactivate feature e.g. B when it's talking to an old server that doesn't support this feature. An API version counter could be the solution. But it requires the client to be able to distinct between server versions. In REST you often see the .../v1/.. inside the URL but could be solved differently as well. Hopefully the API provides some mechanism to get the version the server speaks.
We faced both cases at the same time, the protocol changed over the time including breaking changes, so we needed to implement an API version negotiation mechanism.

Bluemix Auto Scaling API

Is there a way for me to programmatically get notified when Bluemix auto scaling has scaled up or down?
I'm reading streaming data from a queue and would like to make sure the number of instances that I have are balanced and data is partitioned correctly
At present this kind of notification service is not available, only you can do is query the instance scaling history in Web UI. I think this requirement is interesting and should be considered to provide to developer in the future.
This kind of alert isn't available yet but you can write a simple script monitoring output of
cf app (appname)
It returns the number of instances running and the state of each one, with the right combination of awk and grep (or a perl script for example) you could have your own alerter while waiting for this of functionality

Distributed Recovery - can this be done without timeout?

We have a mail sender application, that receives a bunch of mails in one blob, and then puts all those mails into database. This can take up to ten minutes. During this process the state of the mailing is BUILDING.
When it is finished the state gets changed to READY.
When the server crashes (shouldn't happen of course) and restarts, it looks for all mailings with status BUILDING and marks them as ERROR. This happens, because we never want to send incomplete mailings.
Now we'd like to scale up using a second server. The recovery strategy above doesn't work here.
e.g. server 1 is BUILDING a mailing, and server 2 crashes and restarts. Now server 2 will see the BUILDING mailing and doesn't know if it's been aborted or if it's running on another server.
So what's the best recovery strategy for distributed services?
(We thought about some timeout mechanism, where the BUILDING server updates a timestamp every few seconds, and when some server reboots it checks if there's a BUILDING mailing that hasn't been updated for x minutes. Then it's highly possible that this mailing has been aborted.)
EDIT:
What I'd like to achieve: If some server restarts (after a crash or just because we added a new mailing server to the cluster), it should not mark mailings as ERROR if this particular mailing is actually being built (by another server).
Nice to have: If this would work without having to store server ids, because then it's possible to easily add and/or remove servers. Else it would not be possible to completely remove some server, because then there might be a BUILDING mailing with that particular server id. But this server got removed and will never get started again. Though the only server that could set the mailing to ERROR will be gone.
Add two things to your state tracking: a timestamp and the server working on it.
If a server starts up and sees anything in a building state for itself it knows it failed. Conversely, if it starts up and sees something in a building state for another server, it now has information that it's going to need to look at later to see if there's a problem that needs to be addressed. You need to worry about multiple servers restarting at the same time, so you can't just have a server grab all old bundles for all servers at startup.
Or you can just use a clustering service for your OS.

What are the limitations of the flask built-in web server

I'm a newbie in web server administration. I've read multiple times that flask built-in web server is not designed for "production", and must be used only for tests and debug...
But what if my app touchs only a thousand users who occasionnaly send data to the server ?
If it works, when will I have to bother with the configuration of a more sophisticated web server ? (I am looking for approximative metrics).
In a nutshell, I would love to find what the builtin web server can do (with approx thresholds) and what it cannot.
Thanks a lot !
There isn't one right answer to this question, but here are some things to keep in mind:
With the right amount of horizontal scaling, it is quite possible you could keep scaling out use of the debug server forever. When exactly you would need to start scaling (or switch to using a "real" web server) would also depend on the environment you are hosting in, the expectations of the users, etc.
The main issue you would probably run into is that the server is single-threaded. This means that it will handle each request one at a time, serially. This means that if you are trying to serve more than one request (including favicons, static items like images, CSS and Javascript files, etc.) the requests will take longer. If any given requests happens to take a long time (say, 20 seconds) then your entire application is unresponsive for that time (20 seconds). This is only the default, of course: you could bump the thread counts (or have requests be handled in other processes), which might alleviate some issues. But once again, it can still be slow under a "high" load. What is considered a "high" load will be dependent on your application and the expectations of a maximum acceptable response time.
Another issue is security: if you are concerned at ALL about security (and not just the security of the data in the application itself, but the security of the box that will be running it as well) then you should not use the development server. It is not ready to withstand any sort of attack.
Finally, the development server could just fail outright. It is not designed to be used as a long-running process (days, weeks, months), and so it has not been well tested to work in this capacity.
So, yes, it has limitations. Yes, you could still conceivably use it in production. And yes, I would still recommend using a "real" web server. If you don't like the idea of needing to install something like Apache or Nginx, you can still go with a solution that is still as easy as "run a python script" by using some of the WSGI Standalone servers, which can run a server that is designed to be in production with something just as simple as running python run_app.py in the command line. You typically just need to create a 4-5 line python script to import and create the server object, point it to your Flask app, and run it.
gunicorn could be run with only the following on the command line, no extra script needed:
gunicorn myproject:app
...where "myproject" is the Python package that contains the app Flask object. Keep in mind that one of developers of gunicorn would probably recommend against this approach. See https://serverfault.com/questions/331256/why-do-i-need-nginx-and-something-like-gunicorn.
The OP has long-since moved on, but for those who encounter this question in the future I would just add that setting up an Apache server, even on a laptop, is free and pretty easy. It can be readily configured for as few or as many features as you want just by uncomment in or commenting out lines in the config file. There might be an even easier GUI method for doing that nowdays, but just editing the configs is simple.

Upgrading an app running on Lift Framework?

I've recently discovered the lift framework and have read that it's stateful.
Therefore, if I had a high-traffic site running on Lift - say something that was running a chat application that required users to be logged in - and I wanted to upgrade my app, would doing so kick everyone out of chat and make them have to log in again?
None of the previous answers are correct. Many of the artefacts held within the LiftSession are non-serilizable, so cant be stuffed into a database. You have two options for doing rollig upgrades of stateful applications:
1) Session bleeding. Basically you ween one of the deployments sessions away until their sessions have ended or X duration passes and then you remove the app from production whilst automatically rerouting traffic to another instance of Lift. Google around for rolling upgrades using HAProxy as this should help you from the cluster perspective.
2) If your state is fairly trivial (mostly primitive-style types: ints, strings etc) then you could think about using ContainerVar/MigratableSession and clustering the state using terracotta or similar. This comes with a range of limits though because it then uses the HTTPSession rather than LiftSession.
You might want to checkout chapter 15 of Lift in Action which details that latter solution in a fair amount of detail.
If you keep your state in memory and redeploy the web application, that state will be lost. You could save it to a database or a file before redeploying though and read it back from there.