Pattern for Google Alerts-style service - email

I'm building an application that is constantly collecting data. I want to provide a customizable alerts system for users where they can specify parameters for the types of information they want to be notified about. On top of that, I'd like the user to be able to specify the frequency of alerts (as they come in, daily digest, weekly digest).
Are there any best practices or guides on this topic?
My instincts tell me queues and workers will be involved, but I'm not exactly sure how.
I'm using Parse.com as my database and will also likely index everything with Lucene-style search. So that opens up the possibility of a user specifying a query string to specify what alerts s/he wants.

If you're using Rails and Heroku and Parse, we've done something similar. We actually created a second Heroku app that did not have a web dyno -- it just has a worker dyno. That one can still access the same Parse.com account and runs all of its tasks in a rake task like they specify here:
https://devcenter.heroku.com/articles/scheduler#defining-tasks
We have a few classes that can handle the heavy lifting:
class EmailWorker
def self.send_daily_emails
# queries Parse for what it needs, loops through, sends emails
end
end
We also have the scheduler.rake in lib/tasks:
require 'parse-ruby-client'
task :send_daily_emails => :environment do
EmailWorker.send_daily_emails
end
Our scheduler panel in Heroku is something like this:
rake send_daily_emails
We set it to run every night. Note that the public-facing Heroku web app doesn't do this work but rather the "scheduler" version. You just need to make sure you push to both every time you update your code. This way it's free, but if you ever wanted to combine them it's simple as they're the same code base.
You can also test it by running heroku run rake send_daily_emails from your dev machine.

Related

What are the limitations of the flask built-in web server

I'm a newbie in web server administration. I've read multiple times that flask built-in web server is not designed for "production", and must be used only for tests and debug...
But what if my app touchs only a thousand users who occasionnaly send data to the server ?
If it works, when will I have to bother with the configuration of a more sophisticated web server ? (I am looking for approximative metrics).
In a nutshell, I would love to find what the builtin web server can do (with approx thresholds) and what it cannot.
Thanks a lot !
There isn't one right answer to this question, but here are some things to keep in mind:
With the right amount of horizontal scaling, it is quite possible you could keep scaling out use of the debug server forever. When exactly you would need to start scaling (or switch to using a "real" web server) would also depend on the environment you are hosting in, the expectations of the users, etc.
The main issue you would probably run into is that the server is single-threaded. This means that it will handle each request one at a time, serially. This means that if you are trying to serve more than one request (including favicons, static items like images, CSS and Javascript files, etc.) the requests will take longer. If any given requests happens to take a long time (say, 20 seconds) then your entire application is unresponsive for that time (20 seconds). This is only the default, of course: you could bump the thread counts (or have requests be handled in other processes), which might alleviate some issues. But once again, it can still be slow under a "high" load. What is considered a "high" load will be dependent on your application and the expectations of a maximum acceptable response time.
Another issue is security: if you are concerned at ALL about security (and not just the security of the data in the application itself, but the security of the box that will be running it as well) then you should not use the development server. It is not ready to withstand any sort of attack.
Finally, the development server could just fail outright. It is not designed to be used as a long-running process (days, weeks, months), and so it has not been well tested to work in this capacity.
So, yes, it has limitations. Yes, you could still conceivably use it in production. And yes, I would still recommend using a "real" web server. If you don't like the idea of needing to install something like Apache or Nginx, you can still go with a solution that is still as easy as "run a python script" by using some of the WSGI Standalone servers, which can run a server that is designed to be in production with something just as simple as running python run_app.py in the command line. You typically just need to create a 4-5 line python script to import and create the server object, point it to your Flask app, and run it.
gunicorn could be run with only the following on the command line, no extra script needed:
gunicorn myproject:app
...where "myproject" is the Python package that contains the app Flask object. Keep in mind that one of developers of gunicorn would probably recommend against this approach. See https://serverfault.com/questions/331256/why-do-i-need-nginx-and-something-like-gunicorn.
The OP has long-since moved on, but for those who encounter this question in the future I would just add that setting up an Apache server, even on a laptop, is free and pretty easy. It can be readily configured for as few or as many features as you want just by uncomment in or commenting out lines in the config file. There might be an even easier GUI method for doing that nowdays, but just editing the configs is simple.

Is there a way to enable an SQL log to see/optimize my queries using CloudSQL

I started my test of using a Google's CloudSQL instance with a desktop based application, so far I am impressed with a performance, even it is laggy, it does the job, so my next step is to see what simple modifications can do to my application most intended to reduce Access to the database and optimize if there is something more to do.
How can I do log the sql commands send to the database in order to check what queries are being sent. My app uses ODBC drivers in Windows.
Regards
What you probably want is to turn on the general log. Unfortunately, that requires SUPER privileges and that was removed some time ago (announcement). We are going to provide a way to tweak parameters like that via the Cloud SQL API. For now, the best solution is to use a setup a local server and use the logging on that one. If you really want it on production ping us on the google-cloud-sql-discuss Google group and we'll enable the SUPER for your instance.

How do you do continous deployment in an AJAX application with lots of client side interaction and local data?

We have an app that is written in PHP. The front end uses javascript heavily. Generally, for normal applications that require page reloads, continuous deployment is not really an issue, because:
The app can be deployed with build tags: myapp-4-3-2013-b1, myapp-4-3-2013-b2, etc.
When the user loads a page (we are using the front controller pattern), we can inject the buildtag and the files are loaded from the app directory with the correct build tag.
We do not need to keep the older builds around for too long because as the older requests finish, they will move to the newer build tags.
The issue with database and user data being incompatible is not very high as we move people to the newer builds after their requests finish (more on this later).
Now, the problem with our app is that it uses AJAX heavily for smooth page loads. In addition, because there is no page refresh at all when people navigate through the application, people can keep their unsaved data in a their current browser session and revisit it as long as the browser has not been refreshed.
This leads to bigger problems if we want to achieve continuous deployment:
We can keep the user's buildtag in their session (set when they make the first request) and only switch to newer buildtags after the logout and login again. This is obviously bad, because if things like the database schema changes or the format of files to be written to disk changes in a newer build, there is no way to reconcile this.
We force all new requests to a newer build tag, but there is a possibility we change client side javascript and will break a lot of things if we force everyone with a session onto the new build tags immediately.
Obviously, the above won't occur with every build we push and hopefully will not happen a lot, but we want to build a fool proof process so that every build which passes our tests can be deployed. At the same time, we want to make sure that every deployed and test passing build does not inadvertently break in clients with running sessions cause a whole bunch of problems.
I have done some investigation and what google does (at least in google groups) is that they push a message out to the clients to refresh the application (browser window). However, in their case, all unsaved client side data (like unsaved message, etc) would be lost.
Given that applications that uses AJAX and local data are very common these days, what are some more intelligent ways of handling this that will provide minimal disruption to users/clients?
Let me preface this that I haven't ever thought of continuous deployment before reading your post, but it does sound like quite a good idea! I've got a few examples where this would be nice.
My thoughts on solving your problem though would be to go for your first suggestion (which is cleaner), and get around the database schema changes like this:
Implement an API service layer in your application that handles the database or file access, which is outside of your build tag environment. For example, you'd have myapp-4-3-2013-b1, and db-services folders.
db-services would provide any interaction with the database with a series of versioned services. For example, registerNewUser2() or processOrder3().
When you needed to change the database schema, you'd provide a new version of that service and upgrade your build tag environment to look at the new version. You'd also provide a legacy service that handles the old schema to new schema upgrade.
For example, say you registered new users like this:
registerNewUser2(username, password, fullname) {
writeToDB(username, password, fullname);
}
And you needed to update the schema to add the user's date of birth:
registerNewUser3(username, password, fullname, dateofbirth) {
writeToDB(username, password, fullname, dateofbirth);
}
registerNewUser2(username, password, fullname) {
registerNewUser3(username, password, fullname, NULL);
}
The new build tag will be changed to call registerNewUser3(), while the previous build tag is still using registerNewUser2().
So the old build tag will continue to work, just that any new users registered will have a NULL date of birth. When an updated build tag is used, the date of birth is written to the database correctly.
You would need to update db-services immediately, as soon as you roll out the new build tag - or even before you roll out the build tag I guess.
Once you're sure that everyone is using the new version, you can just delete registerNewUser2() from the next version of db-services.
This will be quite complicated to make sure that you are correctly handling the conversion between old API and new API calls, but might be feasible if you're already handling continuous deployment.

How to handle large amounts of scheduled tasks on a web server?

I'm developing a website (using a LAMP stack) which must handle many user-made scheduling tasks. It works as following: an user creates an event and sets a date, and others users (as many as 63) may join. A few hours before the set date, the system must email each user subscribed to that event. And that's it.
However, I have never handled scheduling, and the only tools I know (poorly) are cron and at. My plan is to create an at job for each event, which will call a script that gets all subscribers emails and mails them.
My question is: is my plan/design good? Is it scalable? Are there better options that I should be aware of?
Why a separate cron job for each event? I've done something similar thing for a newsletter with a cron job just running once per hour and if there are any newsletters to be sent it just handles them. In your case you'd have a script that runs once every hour and gets a list of users for events that happen in the desired time interval since.
It will work. As far as scalability, at the minimum make sure that the script runs in it's own process so it doesn't bog down the server unnecessarily.
Create a php-cli script perhaps?
I'm doing most of my work in Rails nowadays, and there's a wealth of background processing libraries one of them is Resque it uses the redis server to keep track of the jobs
I found a PHP clone https://github.com/chrisboulton/php-resque
Might be overkill for your use case, but give it a shot perhaps
If you would consider a proper framework that uses an application server (and not a simple webserver), Spring has a task scheduling layer that's simple to use. Scheduling jobs on the server really requires more than what a simple LAMP install can do, but I haven't used PHP in a while so maybe there's an equivalent.
Here's an article that compares some of your options.

Can Microsoft Windows Workflow route to specific workstations?

I want to write a workflow application that routes a link to a document. The routing is based upon machines not users because I don't know who will ever be at a given post. For example, I have a form. It is initially filled out in location A. I now want it to go to location B and have them fill out the rest. Finally, it goes to location C where a supervisor will approve it.
None of these locations has a known user. That is I don't know who it will be. I only know that whomever it is is authorized (they are assigned to the workstation and are approved to be there.)
Will Microsoft Windows Workflow do this or do I need to build my own workflow based on SQL Server, IP Addresses, and so forth?
Also, How would the user at a workstation be notified a document had been sent to their machine?
Thanks for any help.
I think if I was approaching this problem workflow would work to do it. It is a state machine you want that has three states:
A Start
B Completing
C Approving
However workflow needs to work in one central place (trust me on this, you only want to have one workflow run time running at once, otherwise the same bit of work can be done multiple times see our questions on MSDN forum). So a central server running the workflow is the answer.
How you present this to the users can be done in multiple ways. Dave suggested using an ASP.NET site to identify the machines that are doing the work, which is probably how I would do it. However you could also write a windows forms client that would do the same thing. This would require using something like SOAP / WCF to facilitate communication between client form applications and the central workflow service. This would have the advantage that you could use a system try icon to alert the user.
You might also want to look at human workflow engines, as they are designed to do things such as this (and more), I'm most familiar with PNMsoft's Sequence
You can design a generic "routing" workflow that will cause data to go to a workstation. The easiest way to do this would be to embed the workflow in an ASP.NET application. Each workstation should visit the application with a workstation ID in the querystring:
http://myapp/default.aspx?wid=01
When the form is filled out at workstation A, the workflow running in the web app can enter it into the "work bin" of the next workstation. Anyone sitting at the computer for which the form is destined will see it appear in their list of forms to review. You can use AJAX to make it slick and auto-updating.