I have several perl scripts for data download, validation, database upload etc. I need to write a job controller who can run these scripts in specified manner.
Is there any job controller module in perl?
There are a bunch of options and elements to what you're looking for.
Here for instance is a "job persistence engine"
http://metacpan.org/pod/Garivini
What I think you want might be more comprehensive. You could go big with something like "bamboo" which is a continuous integration/build system. There are several of those if you want to go down that route:
http://en.wikipedia.org/wiki/Continuous_integration
Or you could start with something like RabbitMQ, which bills itself as a message queuing system but has the ability to restart failed jobs and execute things in order, so it has some resilience built in, but you the actual job control software (what watches the queue and executes events?) might need to be written by you, using the Net::RabbitMQ module. I'm not sure.
http://metacpan.org/pod/Net::RabbitMQ
Here is a (Ruby) example of using RabbitMQ to manage job queuing.
How do I trigger a job when another completes?
Related
In a typical web application, there are some things that I would prefer to run as delayed jobs/tasks. They tend to have some or all of the following properties:
Takes a long time (anywhere from multiple seconds to multiple minutes to multiple hours).
Occupy some resource heavily (CPU, network, disk, external API limits, etc.)
Result not immediately necessary. Can complete HTTP response without it. OK (and possibly even preferable) to delay until later.
Can be (and possibly preferable to) run on (a) different machine(s) than web server(s). The machine(s) are potentially dedicated job/task runners.
Should be run in response to other event(s), or started periodically.
What would be the preferred way(s) to set up, enqueue, schedule, and run delayed jobs/tasks in a Scala + Play Framework 2.x app?
For more details...
The pattern I have used in the past, and which I would like to replicate if applicable, is:
In handler of web request, or in cron-like call, enqueue job(s)
In job runner(s), repeatedly dequeue and run one job at a time
Possibly handle recording job results
This seems to be a relatively simple yet still relatively flexible pattern.
Examples I have encountered in the past include:
Updating derived data in DB
Analytics/tracking API calls for a web request
Delete expired sessions or other stale/outdated DB records
Periodic batch ETLs
In other languages/frameworks, I would typically use a job/task framework. Examples include:
Resque in a Ruby + Rails app
Celery in a Python + Django app
I have found the following existing materials, but unfortunately, I don't think they fit my use case directly.
Play 1.x asynchronous jobs API (+ various SO questions referencing it). Appears to have been removed in 2.x line. No reference to what replaced it.
Play 2.x Akka integration. Seems very general-purpose. I'd imagine it's possible to use Akka for the above, but I'd prefer not to write a jobs/tasks framework if one already exists. Also, no info on how to separate the job runner machine(s) from your web server(s).
This SO answer. Seems potentially promising for the "short to medium duration IO bound" case, e.g. analytics calls, but not necessarily for the "CPU bound" case (probably shouldn't tie up CPU on web server, prefer to ship off to different node), the "lots of network" case, or the "multiple hour" case (probably shouldn't leave that in the background on the web server, even if it isn't eating up too many resources).
This SO question, and related questions. Similar to above, it seems to me that this covers only the cases where it would be appropriate to run on the same web server.
Some further clarification on use-cases (as per commenters' request). There are two main use-cases that I have experienced with something like resque or celery that I am trying to replicate here:
Some event on the site (Most often, an incoming web request causes task to be enqueued.)
Task should run periodically. (Most often, this is implemented as: periodically, enqueue task to be run as above.)
In the case of resque or celery, the tasks enqueued by both use-cases enter queues the same way and are treated the same way by the runner/worker process. Barring other Scala or Play-specific considerations, that would be my initial guess for how to approach this.
Some further clarification on why I do not believe the Akka scheduler fits my use case out-of-the-box (as per commenters' request):
While it is no doubt possible to construct a fitting solution using some combination of the Akka scheduler (for periodic jobs), akka-remote and akka-cluster (for communicating between the job caller and the job runner), that approach requires a certain amount of glue code which is almost a delayed job framework in and of itself. If it exists, I would prefer to use an existing out-of-the-box solution rather than reinvent the wheel.
I am using celery to process and deploy some data and I would like to be able to send this data in batches.
I found this:
http://docs.celeryproject.org/en/latest/reference/celery.contrib.batches.html
But I am having problems with it, such as:
It does not play nicely with eventlets, putting exceptions in the log
files stating that the timer is null after the queue is empty.
It seems to leave additional hanging threads after calling celery
multi stop
It does not appear to adhere to the standard logging of a typical
Task.
It does not appear to retry the task when raise mytask.retry() is
called.
I am wondering if others have experienced these problems, and is there a solution?
I am fine with implementing batch processing on my own, but I do not know a good strategy to make sure that all items are deployed (i.e. even those at the end of the thread).
Also, if the batch fails, I would want to retry the entire batch. I am not sure of any elegant way to do that.
Basically, I am looking for any viable solution for doing real batch processing with celery.
I am using Celery v3.0.21
Thanks!
I have looked at the documentation on both, but am not sure what's the best choice for a given application. I have looked closer at celery, so the example will be given in those terms.
My use case is similar to this question, with each worker loading a large file remotely (one file per machine), however I also need workers to contain persistent objects. So, if a worker completes a task and returns a result, then is called again, I need to use a previously created variable for the new task.
Repeating the object creation at each task call is far too wasteful. I haven't seen a celery example to lead me to believe this is possible, I was hoping to use the worker_init signal to accomplish this.
Finally, I need a central hub to keep track of what all the workers are doing. This seems to imply a client-server architecture rather than the one provided by Celery, is this correct? If so, would IPython Parallel be a good choice given the requirements?
I'm currently evaluating Celery vs IPython parallel as well. Regarding a central hub to keep track of what the workers are doing, have you checked out the Celery Flower project here? It provides a webpage that allows you to view the status of all tasks in the queue.
I'd like to move my scheduled tasks into workflows so I can better monitor their execution. Currently I'm using a Window's scheduled task to call a web service that starts the process. Is there a facility that you use to schedule execution of a sequence so that it occurs every N minutes?
My optimal solution would:
Easy to configure
Provide useful feedback on errors
Be 'fire and forget'
PS - Trying out AppFabric for Windows Server if that adds any options.
The most straightforward way I know of would be to make an executable for each workflow (could be console or windows app), and have it host the workflow through code.
This way you can continue to use scheduled tasks to manage the tasks, the main issue is feedback/monitoring the process. For this you could output to console, write to the event log, or even have a more advanced visualisation with a windows app - although you'd have to write this yourself (or google for something!). This MS Workflow Monitoring sample might be of interest, haven't used it myself.
Similar deal with errors, although writing to the event log would be the normal course of action in this case.
I'm not aware of any other hosts for WF, aside from things like Dynamics CRM, but that won't help you with what you're trying to do.
You need to use a scheduler. Either roll your own, use AppFabic as mentioned or use Quartz.NET:
http://quartznet.sourceforge.net/
If you use Quartz, it's either roll your own service host or use the ready-made one and configure it using xml. I rolled my own and it worked fine.
Autorun is another free option... http://autorun.codeplex.com/
I am writing a Bulk Mail scheduler controlled from a Perl/CGI Application and would like to learn abut "good" ways to fork a CGI program to run a separate task? Should one do it at all? Or is it better to suffer the overhead of running a separate job-queue engine like Gearman or TheSchwartz as has been suggested recently. Does the answer/perspective change when using an near-MVC framework like CGI::Application over vanilla CGI.pm? The last comes from a possible project that I have in mind for a CGI::Application Plugin - that would make "forking" a process relatively simple to call.
Look at Proc::Daemon - it's the simplest thing that works. From your CGI script, do the CGI business (getting input, returning a response to the browser), then call Proc::Daemon::init() which does the fork, daemonizes your process and makes the parent exit. Then your script (now a daemon) does its long-running tasks and exits when they're done.
You'll want to update something (file, database record) while running as a daemon, so subsequent CGI invocations can check what it did (or how it's progressing).
Would something like POE be useful? It's more event-driven than forked, but it may meet your needs.