Batch processing with celery? - celery

I am using celery to process and deploy some data and I would like to be able to send this data in batches.
I found this:
http://docs.celeryproject.org/en/latest/reference/celery.contrib.batches.html
But I am having problems with it, such as:
It does not play nicely with eventlets, putting exceptions in the log
files stating that the timer is null after the queue is empty.
It seems to leave additional hanging threads after calling celery
multi stop
It does not appear to adhere to the standard logging of a typical
Task.
It does not appear to retry the task when raise mytask.retry() is
called.
I am wondering if others have experienced these problems, and is there a solution?
I am fine with implementing batch processing on my own, but I do not know a good strategy to make sure that all items are deployed (i.e. even those at the end of the thread).
Also, if the batch fails, I would want to retry the entire batch. I am not sure of any elegant way to do that.
Basically, I am looking for any viable solution for doing real batch processing with celery.
I am using Celery v3.0.21
Thanks!

Related

How do I continue where I left off in Spring Batch?

So I wrote an ItemReader. When this app is run from the command line again I want to continue reading from where I left off. How do I do that?
I've added spring-task. It seems to track certain things. Does it help with this?
Everything I have read online seems to be talking about restarting the job after a failure. I don't think I have any use for that. I've added all of my stuff into the ExecutionContext. Should I use the JobRepository and start looking around for the last successful execution?

Intercepting and stopping a celery beat task before publishing to message bus

I am using signals to intercept celery beat tasks before publishing. This works fine. But, in addition I want to execute some logic and, based on the result, possibly cancel the task.
I cannot find a way to cancel the task from the event handler, aside from raising an exception and that seems very inelegant.
The background is that I am implementing distributed task processing using cache locks and I am performing CAS operations on the lock before publishing.
Is there any way to implement this using current celery/celerybeat functionality?
Thanks

Perl Job controller

I have several perl scripts for data download, validation, database upload etc. I need to write a job controller who can run these scripts in specified manner.
Is there any job controller module in perl?
There are a bunch of options and elements to what you're looking for.
Here for instance is a "job persistence engine"
http://metacpan.org/pod/Garivini
What I think you want might be more comprehensive. You could go big with something like "bamboo" which is a continuous integration/build system. There are several of those if you want to go down that route:
http://en.wikipedia.org/wiki/Continuous_integration
Or you could start with something like RabbitMQ, which bills itself as a message queuing system but has the ability to restart failed jobs and execute things in order, so it has some resilience built in, but you the actual job control software (what watches the queue and executes events?) might need to be written by you, using the Net::RabbitMQ module. I'm not sure.
http://metacpan.org/pod/Net::RabbitMQ
Here is a (Ruby) example of using RabbitMQ to manage job queuing.
How do I trigger a job when another completes?

Celery vs Ipython parallel

I have looked at the documentation on both, but am not sure what's the best choice for a given application. I have looked closer at celery, so the example will be given in those terms.
My use case is similar to this question, with each worker loading a large file remotely (one file per machine), however I also need workers to contain persistent objects. So, if a worker completes a task and returns a result, then is called again, I need to use a previously created variable for the new task.
Repeating the object creation at each task call is far too wasteful. I haven't seen a celery example to lead me to believe this is possible, I was hoping to use the worker_init signal to accomplish this.
Finally, I need a central hub to keep track of what all the workers are doing. This seems to imply a client-server architecture rather than the one provided by Celery, is this correct? If so, would IPython Parallel be a good choice given the requirements?
I'm currently evaluating Celery vs IPython parallel as well. Regarding a central hub to keep track of what the workers are doing, have you checked out the Celery Flower project here? It provides a webpage that allows you to view the status of all tasks in the queue.

Quartz job fires multiple times

I have a building block which sets up a Quartz job to send out emails every morning. The job is fired three times every morning instead of once. We have a hosted instance of Blackboard, which I am told runs on three virtual servers. I am guessing this is what is causing the problem, as the building block was previously working fine on a single server installation.
Does anyone have Quartz experience, or could suggest how one might prevent the job from firing multiple times?
Thanks,
You didn't describe in detail how your Quartz instance(s) are being instantiated and started, but be aware that undefined behavior will result if you run multiple Quartz instances against the same job store database at the same time, unless you enable clustering (see http://www.quartz-scheduler.org/docs/configuration/ConfigJDBCJobStoreClustering.html).
I guess I'm a little late responding to this, but we have a similar sort of scenario with our application. We have 4 servers running jobs, some of which can run on multiple servers concurrently, and some should only be run once. As Will's response said, you can look into the clustering features of Quartz.
Our approach was a bit different, as we had a home-grown solution in place before we switched to Quartz. Our jobs utilize a database table that store the cron triggers and other job information, and then "lock" the entry for a job so that none of the other servers can execute it. This keeps jobs from running multiple-times on the servers, and has been fairly effective so far.
Hope that helps.
I had the same issue before but I discovered that I was calling scheduler.scheduleJob(job, trigger); to update the job data while the job is running which is randomly triggered the job 5-6 times each run. I had to use the following to update the job data without updating the trigger scheduler.addJob(job, true);