Feedback desired: non-disruptive deployment strategies for production Lisp webapps - deployment

I am interested in hearing how people do their Lisp webapp deployments and updates (especially updates) in production.
In Ruby many, myself included, use Capistrano for deployments. It provides some nice indirection and the ability to execute commands remotely and most importantly (in my mind) the ability to rollback to a working code base.
I know that the idea of a long running Lisp process being connected to via Swank through an SSH tunnel and modified in place is a popular idea that's knocked around, but I haven't drunk that Koolaid, mostly because of the issue of updating a stateful process (which seems like asking for trouble if something goes wrong - like unforeseen impedance mismatches between current state in memory and new object definitions that will soon be in memory).
Given that you can create nearly stateless (or completely) webapps using hunchentoot (or insert your favorite Lisp app server here), it seems like using something like Capistrano could be used for non-disruptive updates to Lisp code too if the Lisp process(es) hide behind nginx in its upstream channel and if you can correctly choreograph taking down the hunchentoot processes and spin them back up after an update to code, e.g., bring them back up all the while leaving at least one hunchentoot process running in the cluster at any given moment (CGI or mod_lisp could be used, but I am not particularly interested in that approach - though if you really like that approach, please at least say something about it, I want to learn). For instance, using Passenger (which is comparing oranges to apples since it spins up processes on demand), you touch tmp/restart.txt and the app server restarts this time with freshly updated code - no interruptions from the users perspective.
Well, this is a bit of a ramble, and actually I am about to try all this out, but I'd like to get some feedback on these ideas from others. Maybe you have a better idea.
Thanks

You can accomplish non-disruptive (zero downtime) deployments by writing capistrano scripts for an intelligent front-end/load balancer like HAProxy that pulls app servers out of rotation, restarts them with the newly deployed code, and puts them back in the mix.
By incrementally rolling your appservers while they are out of live rotation in production you can achieve smooth deployments.
This doesn't touch on having persistent app server loops with specific state, that seems scary for exactly the reasons you mentioned. REPLs are cool for debugging and tweaking, but your instincts to run the code on disk seem well founded.

Related

how do you dynamically insert console logs on a development server

When you're developing on localhost, then you've got full access to a terminal that you can log anywhere you want. But, in a project, I work on (and am new to team collaboration as a whole) they use something called weavescope to view logs that developers have created at the time of coding.
Now what the difference between this and logging locally, everytime you'll create a change in the code, you gotta send a pull request, they approve it, and merge it, deploy it and we finally see it in the log. Now, sometimes the state of local and deployed things don't match and it really makes us wanna dynamically log on to the development server without having to go through all these cycles over again. Is there any solution already around that helps us insert some quick log statements without having to go through the routine PR, merge, deploy cycle?
EDIT: I think from discussions I had below, the tool I am looking for is more or less a logging statment code injection tool. A tool that would keep track of the logs I'm inserting into the production code, and on/off them at spin of a command.
This seems like something that logging levels can help with (unless I'm misunderstanding). Something I typically do is leave debug-level log messages on commonly problematic or complex functions, but change the logging level to something higher when I move out of local. Sometimes depending on the app and access these can be configured at the environment rather than in the build.
For example there are Spring libraries that will let you import a static logger, set the level of each message you log out. Then locally you can keep the level at DEBUG, in UAT the level can be INFO, and if you only want ERROR OR WARN messages in prod you can separate that too. At the time of deployment you can set what environment it is and store a separate app.properties or yml file for each environment storing the desired level for each
Of course there is a solution for fast pace code changes.
Maybe this kind of hot reloading is what you're looking for. This way you can insert new calls to a logger or console.log quickly.
Although it does come with a disclaimer from the author.
I honestly haven't looked into whether this method of hot reloading would provide stable production zero-downtime deploys, however my "gut feel" says don't do it. And production deployments are probably one area where we should stick to known, trusted procedures unless we have good reason.

What are the limitations of the flask built-in web server

I'm a newbie in web server administration. I've read multiple times that flask built-in web server is not designed for "production", and must be used only for tests and debug...
But what if my app touchs only a thousand users who occasionnaly send data to the server ?
If it works, when will I have to bother with the configuration of a more sophisticated web server ? (I am looking for approximative metrics).
In a nutshell, I would love to find what the builtin web server can do (with approx thresholds) and what it cannot.
Thanks a lot !
There isn't one right answer to this question, but here are some things to keep in mind:
With the right amount of horizontal scaling, it is quite possible you could keep scaling out use of the debug server forever. When exactly you would need to start scaling (or switch to using a "real" web server) would also depend on the environment you are hosting in, the expectations of the users, etc.
The main issue you would probably run into is that the server is single-threaded. This means that it will handle each request one at a time, serially. This means that if you are trying to serve more than one request (including favicons, static items like images, CSS and Javascript files, etc.) the requests will take longer. If any given requests happens to take a long time (say, 20 seconds) then your entire application is unresponsive for that time (20 seconds). This is only the default, of course: you could bump the thread counts (or have requests be handled in other processes), which might alleviate some issues. But once again, it can still be slow under a "high" load. What is considered a "high" load will be dependent on your application and the expectations of a maximum acceptable response time.
Another issue is security: if you are concerned at ALL about security (and not just the security of the data in the application itself, but the security of the box that will be running it as well) then you should not use the development server. It is not ready to withstand any sort of attack.
Finally, the development server could just fail outright. It is not designed to be used as a long-running process (days, weeks, months), and so it has not been well tested to work in this capacity.
So, yes, it has limitations. Yes, you could still conceivably use it in production. And yes, I would still recommend using a "real" web server. If you don't like the idea of needing to install something like Apache or Nginx, you can still go with a solution that is still as easy as "run a python script" by using some of the WSGI Standalone servers, which can run a server that is designed to be in production with something just as simple as running python run_app.py in the command line. You typically just need to create a 4-5 line python script to import and create the server object, point it to your Flask app, and run it.
gunicorn could be run with only the following on the command line, no extra script needed:
gunicorn myproject:app
...where "myproject" is the Python package that contains the app Flask object. Keep in mind that one of developers of gunicorn would probably recommend against this approach. See https://serverfault.com/questions/331256/why-do-i-need-nginx-and-something-like-gunicorn.
The OP has long-since moved on, but for those who encounter this question in the future I would just add that setting up an Apache server, even on a laptop, is free and pretty easy. It can be readily configured for as few or as many features as you want just by uncomment in or commenting out lines in the config file. There might be an even easier GUI method for doing that nowdays, but just editing the configs is simple.

How to stop users from visiting staging area after production deployment

We have a few servers that have different roles. For instance, we have production servers, and testing/staging servers. We have a few end users who forget to switch paths to production once things are tested and approved or use; They use the new paths for a bit, then revert back to using the testing/staging at some point for some reason that we can't understand other than stupidity. We still want to be able to get a glimpse into our staging environment after pushing a build into production, but we want to stop them from being able to still hit those servers/services.
We are now pondering some solutions to this problem. One being never give them the direct staging url. An idea would be to create a virtual directory or have a set of domain aliases that we could give them and then shut down while still allowing us access to these endpoints. We could restrict our main staging domain to the office ip range so they never have direct access and call it good.
Does this sound like a good solution? Is our process wrong, are there better routes?
I am interested in solutions for websites as well as web services where visuals can't be used effectively.
We've run into this at my work as well… quite recently in fact. One thing that I thought about other than the virtual directory was setting up specific ports for them to test on then either take the ports down or change them for our internal uses only.
Well without details in how your application is deployed it could be troublesome to give concrete examples. One wonderful solution is to get better users :P Perhaps a more possible solution however is to let your production boxes move a certain set of users(as decided in your code) to your test/staging systems. I.E. the User always connects to Production, but the production machines at connect/auth time, may decide these people are too cool for production let them run the test/staging code instead.
It's not a fullproof method of course, but it works for many many websites to let a certain set of users into different parts of their codebase.
I don't know how feasible this would be for you, but it's a possibility perhaps.
I find that users sometimes have difficulties with URLs, and don't like to have subtle changes like port number in the address.
The best approach I've found is to have the application tell the user what environment they are in.
For example, my teams have used absolutely positioned headers or footers, color coded for Dev/Staging environments that show the application version number with an alpha/beta tag, along with a message that says "Work done on this site will be lost, use Production (link) to keep your work." Typically we make the Dev area red, and the staging area yellow. We also like to put a link to the bug tracking system right in this area.
On production there is not usually a region like this. However, we do sometimes provide positive reinforcement by placing a green region, with the app version and a Production tag in it, and then fade the green region away after a few seconds. This helps keep the app front and center, but let's the user know they are in the right place.

How is Accurev Performance?

How is performance in the current version (4.7) of Accurev?
time to checkout per 100mb, per gb?
time to commit per # of files or mb?
responsiveness of gui when 100+ streams?
I just had a demo of Accurev, and the streams look like a lightweight way to model workflow around code/projects. I've heard people praising Accurev for the streams back end and complaining about performance. Accurev appears to have worked on the performance, but I'd like to get some real world data to make sure it isn't a case of demos-well-runs-less-well.
Does anyone have Accurev performance anecdotes or (even better) data from testing?
I don't have any numbers but I can tell you where we have noticed performance issues.
Our builds typically use 30-40K files from source control. In my workspace currently there are over 66K files including build intermediate and output files, over 15GB in size. To keep AccuRev working responsively we aggressively use the ignore elements so AccuRev ignores any intermediate files such as *.obj. In addition we use the time stamp optimization. In general running an update is quick, but the project sizes are typically 5-10 people so normally only a couple of dozen files come down if you update daily. Even if someone made changes that touched lots of files speed is not an issue. On the other hand a full populate of all 30K+ files is slow. I don't have a time since I seldom do this and on the rare occasion I do, I run the populate when I'm going to lunch or a meeting. I expect it could be as much as 10 minutes. In general source files come down very quickly, but we have some large binary files, 10-20MB, that take a couple of seconds each.
If the exclude rules and ignore elements are not correctly configured, AccuRev can take a couple of minutes to run an update for workspaces of this size. When I hear of other developers complaining about the speed I know something is miss-configured and we get it straightened out.
A year or so ago one of the project updated boost with 25K+ files and also added FireFox to the repository (forget the size but made boost look small.) They also added ICU, wrote a lot of software and modified countless files. In all I recall there were approx 250K+ files sitting in a stream. I unfortunately decided that all their good code should be promoted to the root so all projects could share. This turned out to be a little beyond what AccuRev could handle well. It was a multi hour process getting all the changes promoted. As I recall once FireFox was promoted the rest went smoothly - perhaps a single transaction with over 100K files was the issue?
I recently updated boost and so had to keep and promote 25K+ files. It took a minute or two but not unreasonable considering the number of files and the size of the binaries.
As for the number of streams, we have over 800 streams and workspaces. Performance here is not an issue. In general I find the large number of streams hard to navigate so I run a filtered view of just my workspaces and the just streams I'm interested in. However when I need to look at the unfiltered list to find something performance is fine.
As a final note, AccuRev support is terrific - we call them the voice in the sky. Every now and again we shoot ourselves in the foot using AccuRev and wind up clueless on how to fix things. Almost always we did something dumb and then tried something dumber to fix it. Eventually we place a support request and next thing we know they are walking us through the steps to righteousness either on the phone or a goto meeting. I've even contacted them for trivial things that I just don't have time to figure out as I'm having a hectic day and they kindly walk me through it rather than telling me to RTFM.
Edit 2014: We can now get acceptable X-Windows performance by using the commercial version of RealVNC.
Original comment:This answer applies to any version of Accurev, not just 4.7. Firstly, GUI performance might be OK if you can use the web client. If you can't use the web client and if you want GUI performance then you'd better be using Windows, or have all your developers in one place, i.e. where the Accurev server is located. Try to run the GUI on X-Windows over a WAN ? Forget it : our experience has been dozens of seconds or minutes for basic point and click operations. This is over a fairly good WAN about 800 miles distant, with an almost optimal ping time. This is not a failing of Accurev, but of X-Windows, and you'll likely have similar problems with other X applications over a WAN. So avoid basic X if you possibly can. Currently we cannot, and our WAN users are forcibly relegated to command-line only. The basic problem is that Accurev is is centralized and you can't increase the speed of light. I believe you can get around WAN latency by running Accurev Replication Servers, but that still does not properly address the problem if you have remote developers at single-person offices over VPN. It is ironic that the replication servers somewhat turn this centralized VCS into a form of DVCS. If you don't have replication servers then a horrible but somewhat workable work-around is to use a delta-synchronization tool such as rsync to sync your source tree between your local machine where you can run the GUI (i.e. GUI running directly on your Windows or Linux laptop), and the machine where you're actually working (e.g. UNIX machine 1,000 miles away). Another option is to use something like VNC which works better over a WAN than X, connecting to a virtual desktop at the Accurev server's location, and use X from there. At my workplace more than one team has resorted to using Mercurial on the side and promoting to Accurev only when it's strictly necessary. As Stephen Nutt points out above, other necessary work is to use time-stamp optimization and ignores. We also have our Accurev admins (yes, it requires you employ people to baby sit it) complain when we need to include large numbers of files, despite the fact they form a core part of our product and MUST be included and version controlled. Draw your own conclusions.

How to limit the effect of client modifications to production systems

Our shop has developed a few WEB/SMS/DB solution for a dozen client installations. The applications have some real-time performance requirements, and are just good enough to function properly. The problem is that the clients (owners of the production servers) are using the same server/database for customizations that are causing problems with the performance of the applications that we created and deployed.
A few examples of clients' customizations:
Adding large tables with many text datatypes for the columns that get cast to other data types in the queries
No primary keys, indexes, or FK constraints
Use of external scripts that use count(*) from table where id = x, in a loop from the script, to determine how to construct more queries later in the same script. (no bulk actions that the planner can optimize or just do everything in a single pass)
All new code files on the server are created/owned by root, with 0777 permissions
The clients don't take suggestions/criticism well. If we just go ahead and try to port/change the scripts ourselves, the old code can come back, clobbering any changes that we make! Or with out limited knowledge of their use cases, we break functionality while trying to optimize their changes.
My question is this: how can we limit the resources to queries/applications other that what we create and deploy? Are there any pragmatic options in scenarios like this? We prided ourselves in having an OSS solution, but it seems that it's become a liability.
We use PG 8.3 running on a range on Linux Distos. The clients prefer php, but shell scripts, perl, python, and plpgsql are all used on the system in one form or another.
This problem started about two minutes after the first client was given full access to the first computer, and it hasn't gone away since. Anytime someone whose priorities are getting business oriented work done quickly they will be sloppy about it and screw up things for everyone. That's just how things work, because proper design and implementation are harder than cheap hacks. You're not going to solve this problem, all you can do is figure out how to make it easier for the client to work with you than against you. If you do it right, it will look like excellent service rather than nagging.
First off, the database side. There's now way to control query resources in PostgreSQL. The main difficulty is that tools like "nice" control CPU usage, but if the database doesn't fit in RAM it may very well be I/O usage that is killing you. See this developer message summarizing the issues here.
Now, if in fact it's CPU the clients are burning through, you can use two techniques to improve that situation:
Install a C function that changes the process priority (example 1, example 2) and make sure whenever they run something it gets called first (maybe put it into their psql config file, there are other ways).
Write a script that looks for postmaster processes spawned by their userid and renice them, make it run often in cron or as a daemon.
It sounds like your problem isn't the particular query processes they're running, but rather other modifications they're making to the larger structure. There's only one way to cope with that: you have to treat the client like they're an intruder and use the approaches of that portion of the computer security field to detect when they screw things up. Seriously! Install an intrusion detection system like Tripwire on the server (there are better tools, that's just the classic example), and have it alert you when they touch anything. New file that's 0777? Should jump right out of a proper IDS report.
On the database side, you can't directly detect the database being modified usefully. You should do a pg_dump of the schema every day into a file (pg_dumpall -g and pg_dump -s, then diff that against the last one you delivered and again alert you when it's changed. If you manage that this well, the contact with the client turns into "we noticed you changed on the server...what is it you're trying to accomplish with that?" which makes you look like you're really paying attention to them. That can turn into a sales opportunity, and they may stop fiddling with things as much just knowing you're going to catch it immediately.
The other thing you should start doing immediately is install as much version control software as you can on each client box. You should be able to login to each system, run the appropriate status/diff tool for the install, and see what's changed. Get that mailed to you regularly too. Again, this works best if combined with something that dumps the schema as a component to what it manages. Not enough people use serious version control approaches on the code that lives in the database.
That's the main set of technical approaches useful here. The rest of what you've got is a classic consulting client management problem that's far more of a people problem than a computer one. Cheer up, it could be worse--FSM help you if you give them ODBC access and they discover they can write their own queries in Access or something simple like that.