How to handle large amounts of scheduled tasks on a web server? - scheduled-tasks

I'm developing a website (using a LAMP stack) which must handle many user-made scheduling tasks. It works as following: an user creates an event and sets a date, and others users (as many as 63) may join. A few hours before the set date, the system must email each user subscribed to that event. And that's it.
However, I have never handled scheduling, and the only tools I know (poorly) are cron and at. My plan is to create an at job for each event, which will call a script that gets all subscribers emails and mails them.
My question is: is my plan/design good? Is it scalable? Are there better options that I should be aware of?

Why a separate cron job for each event? I've done something similar thing for a newsletter with a cron job just running once per hour and if there are any newsletters to be sent it just handles them. In your case you'd have a script that runs once every hour and gets a list of users for events that happen in the desired time interval since.
It will work. As far as scalability, at the minimum make sure that the script runs in it's own process so it doesn't bog down the server unnecessarily.
Create a php-cli script perhaps?
I'm doing most of my work in Rails nowadays, and there's a wealth of background processing libraries one of them is Resque it uses the redis server to keep track of the jobs
I found a PHP clone https://github.com/chrisboulton/php-resque
Might be overkill for your use case, but give it a shot perhaps

If you would consider a proper framework that uses an application server (and not a simple webserver), Spring has a task scheduling layer that's simple to use. Scheduling jobs on the server really requires more than what a simple LAMP install can do, but I haven't used PHP in a while so maybe there's an equivalent.
Here's an article that compares some of your options.

Related

Workflow platform for managing the processing of incoming files

In general, I have a single workflow that I want to be able to monitor. The workflow should start whenever new files arrive or alternatively at certain scheduled times, i.e. I want to be able to insert new "jobs" to the workflow as they come, and process the files by going through multiple different tasks and steps. I want to be able to monitor each file going through the tasks.
The queues and distributing the load for each task might be managed by Celery, but it's not decided yet either.
I've looked at Apache Airflow, and as far as I understand at the moment, is geared more towards monitoring many different workflows, such that each workflow is mostly running from start to end, not adding new files to the beginning of the flow before the previous run ended.
Cadence workflow seems like can do what I need, but also seems to be a bit of an overkill.
I'm not expecting a specific final solution here, but I would appreciate suggestions to more such solutions that I can look into and can fit the above.
Luigi - https://luigi.readthedocs.io/en/stable/
Extremely light-weight and fast compared to Airflow.

Pattern for Google Alerts-style service

I'm building an application that is constantly collecting data. I want to provide a customizable alerts system for users where they can specify parameters for the types of information they want to be notified about. On top of that, I'd like the user to be able to specify the frequency of alerts (as they come in, daily digest, weekly digest).
Are there any best practices or guides on this topic?
My instincts tell me queues and workers will be involved, but I'm not exactly sure how.
I'm using Parse.com as my database and will also likely index everything with Lucene-style search. So that opens up the possibility of a user specifying a query string to specify what alerts s/he wants.
If you're using Rails and Heroku and Parse, we've done something similar. We actually created a second Heroku app that did not have a web dyno -- it just has a worker dyno. That one can still access the same Parse.com account and runs all of its tasks in a rake task like they specify here:
https://devcenter.heroku.com/articles/scheduler#defining-tasks
We have a few classes that can handle the heavy lifting:
class EmailWorker
def self.send_daily_emails
# queries Parse for what it needs, loops through, sends emails
end
end
We also have the scheduler.rake in lib/tasks:
require 'parse-ruby-client'
task :send_daily_emails => :environment do
EmailWorker.send_daily_emails
end
Our scheduler panel in Heroku is something like this:
rake send_daily_emails
We set it to run every night. Note that the public-facing Heroku web app doesn't do this work but rather the "scheduler" version. You just need to make sure you push to both every time you update your code. This way it's free, but if you ever wanted to combine them it's simple as they're the same code base.
You can also test it by running heroku run rake send_daily_emails from your dev machine.

Is it possible to have an "internal" cron in mysql5?

The other day a friend suggested to play a web browser game called OGame. If you don't know it I'll tell you what it is:an rts game where you have to build things like mining factories, barracks and so on. The interesting thing that every building has a build time and you can log off while it's building because it will keep going.
Something like this I would believe is managed via dbms. I have my records where I have the end time of a costruction. How do I check when to update a building? Do I need an external application that checks every seconds what record needs to be updated? Is it possible with mysql5 to have an internal scheduler that launches a procedure on this table? And if so, is it a best practice?
I have built a similar game and I stored the construction end times (and other events to be fired) in an events table. I wrote a PHP daemon which regularly checks the events table for expired records and acts on them accordingly.
I couldn't find a way to do it in the database itself (and if I later wanted to migrate to another DB it would need rewriting). A cron'd script may overlap. A daemon can keep track of everything all the time, and output debug information if events are queuing faster than they're being processed. I also added a cron to check periodically that my daemon is still running, otherwise start it.
Creating a daemon in PHP (if you're using PHP)
Hope that helps.

MS CRM recursive workflow and performance

I’m about to write a workflow in CRM that calls itself every day. This is a recursive workflow.
It will run on half a million entities each day and deactive the record if it was not been upodated in the past 3 days.
I’m worried about performance has anyone else done this.
I haven't personally implemented anything like this, but that's 500,000 records that are floating around in the DB that the async service has to keep track of, which is going to tax your hardware. In addition, CRM keeps track of recursive workflow instances. I don't have the exact specs in front of me, but if a workflow calls itself a set number of times within a certain timeframe, CRM will kill the workflow.
Could you just write a console app that asks the Crm Service for records that haven't been updated in three days, and then deactivate them? Run it as a scheduled task once a day, and then your CRM system doesn't have the burden of keeping track of all those running workflow instances.
EDIT: Ah, I see now you might have been thinking of one workflow that runs on all the records as opposed to workflows running on each record. benjynito's advice makes sense if you go this route, although I still think a scheduled task would be more appropriate than using workflow.
You'll want to make sure your workflow is running in non-peak hours. Assuming you have an on-premise installation you should be able to get away with that. If you're using a hosted instance, you might be worried about one organization running the workflow while another organization is using the system. Use the timeout and maybe a custom workflow activity, if necessary, to force the start time to a certain period.
I'm assuming you'll be as efficient as possible in figuring out which records to deactivate. (i.e. Query Expression would only bring back the records you'll be deactivating).
The built-in infinite loop-protection offered by CRM shouldn't kill your workflow instances. It stops after a call depth of 8, but it resets to 1 if no calls are made for an hour. So the fact that you're doing this once a day should make you OK on the recursive workflow front.

How can I defer processing during apache / mod_perl page rendering?

I have an apache2 / mod_perl website. On one page, I need to do some server/server communication via SOAP.
The results of this communication are not required for the rendering of the page (but user input is required to trigger this communication).
The SOAP communication is very slow.
So what I want to do is process and print the page for the user, then do all the SOAP stuff behind the scenes.
What's the best way to achieve this? start some fork? write the job to a file and have a cronjob pick it up?
Thanks
There are two types of solutions: First you can do what Randal Schwartz suggested here. Second you could use a Message Queue like Beanstalk or Gearman. Beanstalk has a Perl Client and is now persistent and is ideal for lightweight stuff. Gearman on the other hand has more features, more worked on. There is also TheSchwartz - use it if you can do without too much documentation. cron is ideal for systematically repeating tasks. For the kind of application you have, it appears that Schedule::At might be more appropriate if you prefer a more generic "message-queue"
Also see an old StackOverflow Thread here