I have a task that takes approximately 3 minutes to run. It pulls data from a remote server and makes cpu-intensive analysis on it. This task will be invoked by an api call. Upon the api call, i am planning to give client a unique task id and assign the task to a celery worker. Then the client will poll the server with the given task id to see if the task is completed by celery worker and its result it saved to a result backend. I think of using nginx, gunicorn, flask and dockerize them for a easy deploy in case i need to distribute this architecture across multiple machines.
The problem is that the client may poll different servers due to load balancer and if not handled well, the polled server’s celery’s result backend might not have the task’s result but other server’s celery result backend has it.
Is it possible to use a single result backend over multiple celery instances and make different celery instances wuery the same result backend? What might be other possible ways to solve this other than using cloud storage like S3?
Would I have this problem only if I have multiple machines or would it happen even if I have multiple gunicorn instances in a single machine where nginx acts as a load balancer on them?
Not that it is possible to use a single result backend by all Celery workers, but that is the only setting that makes sense! Same goes for the broker in most cases, unless you have a complicated Celery infrastructure with exchanges, and complicated routes...
Related
What is the algorithm used to distribute the task load between workers in celery?
I checked the documentation, could not find the info.
This will depend on the broker that is used. For example, for redis, each of worker process uses kombu's redis transport which in turn calls brpop to get the next task available. redis implements brpop using the longest-waiting client algorithm to allocate a certain task to a certain client (waiting celery worker process).
Newbie alert.
I'm trying to write a simple module in Vertx that polls the database (PostGres) every 10 seconds and pushes the results to the clients. I'm thinking of confining the blocking code (queries the database via JDBC) in a worker verticle and rest of the above layers are completely non-blocking and async.
This module will be packaged as a jar and distributed to a different apps (typically webapps) which can subscribe to the event bus via the javascript bridge.
My question here is in a clustered environment where I have 5 processes of the webapp running with the vertx modules, how can I ensure that there's only one vertx verticle querying the database. I don't want all the verticles querying the database and add more load. Or is there a different way to think to solve this problem. I'm using Vertx version 3.4.1
So there are 2 ways how your verticle can be multiplied:
If you instantiate multiple instances when you deploy your verticle
If you start to cluster your vert.x instances in different jvm's or different hosts
You could try to control the number of instances of your verticle which executes the query. Means you ensure, that the verticle only exists in one of your vert.x instances and your verticle is deployed with only one instance.
But this has several drawbacks:
your deployment is not transparent, means your cluster nodes differ in the deployment structure.
if your cluster node dies, where the query verticle is running, then you have no fallback.
So the best thing is, to deploy the verticle on all instances and synchronize it.
I see 3 possibilites:
Use hazelcast (the clustermanager of vert.x) to synchronize
http://vertx.io/docs/apidocs/io/vertx/spi/cluster/hazelcast/HazelcastClusterManager.html#getLockWithTimeout-java.lang.String-long-io.vertx.core.Handler-
There are also datastructures available, which are synchronized over
the cluster
http://vertx.io/docs/apidocs/io/vertx/spi/cluster/hazelcast/HazelcastClusterManager.html#getSyncMap-java.lang.String-
Use your database as synchronization point. you could add a simple
table which stores the last execution time in millis. The polling
modules, will first check if it is time to execute the next poll. If
the polling module executes the poll it also updates the time. This
has to be done in one transaction with a explicit lock on the time
table.
You use redis with the https://redis.io/commands/getset
functionality. You can store the time in millis in a key and ensure
with the getset method, that the upgrade of the time is atomic. So only the polling module which could set the key in redis, will execute the poll.
I'm giving out my naive solution here, I don't know if it would completely solve your problem or not but here is my thought process.
1) Polling bit, yes indeed you can have a worker verticle for blocking call's [ or else you could use Async bit here too IMHO because you already have Async Postgress JDBC client ] for the every 10secs part. code snippet like this can help you
vertx.setPeriodic(10000, id -> {
// This handler will get called every 10 seconds
JsonObject jdbcObject = fetchFromJdbc();
eventBus.publish("INTRESTED_PARTIES", jdbcObject);
});
2) For the listening part all the other verticles can subscribe to event bus and listen for the that address and would be getting the message whenever things would happen
3) This is for ensuring part that not all running instances of your jar start polling the database, for this I think the best possible way to handle would be not deploying the verticle in any jar and running the verticle in an standalone way using runtime vertx command like
vertx run DatabasePoller.java -cluster
And if you really want to be very fancy you could throw in Service Discovery for ensuring part that if the service of the verticle is already register then no other deployments would trigger registrations.
But I want to give you thumbs up on considering the events for getting that information much better way for handling inter-system communication.
I am looking for a way to distribute jobs over SOAP-based Web-Services that can be randomly switched on and off on the Cloud, and can exist in one or several instances.
I went through the tutorials of Celery, and it seems a very interesting tool to distribute tasks.
However in my case, I don't have access to the hosts of the SOAP webservices , so I can't add any extra services on them. And I can't turn them into "worker nodes" for Celery.
I thought I could maybe create "mirrors" worker-nodes (one per SOAP web-services) on the machine that will be the like an intermediary between the Celery client and the SOAP-services.
My knowledge in Celery being limited, I wonder if this can be a good solution, and what would be the limits.
I have read in the documentation that it is possible to tune the number of processes executed on a machine with:
CELERYD_CONCURRENCY
The default value being CELERYD_CONCURRENCY = number of CPUs
It seems to me that I can use this option on the "Mirrors Workers" that would stand all on the same machine, each "mirror worker" have a CELERYD_CONCURRENCY value corresponding to how many execute I would allow on each SOAP service.
Does it seems acheivable with Celery, or is it very "hacky" ?
I'm new to dotcloud, and am confused about how multiple services work together.
my yaml build file is:
www:
type: python
db:
type: postgresql
worker:
type: python-worker
broker:
type: rabbitmq
And my supervisord file contains commands to start django celery & celerycam.
When I push my code out to my app, I can see that both the www & worker services start up their own instances of celery & celery cam, and also for example the log files will be different. This makes sense (although isn't made very clear in the dotcloud documentation in IMO - the documentation talks about setting up a worker service, but not how to combine that with other services), but does raise the question of how to configure an application where the python service mainly serves the web page, whilst the python worker service works on background tasks, eg: celery.
The dotcloud documentation daemon makes mention of this:
"However, you should be aware that when you scale your application,
the cron tasks will be scheduled in all scaled instances – which is
probably not what you need! So in many cases, it will still be better
to use a separate service.
Similarly, a lot of (non-worker) services already run Supervisor, so
you can run additional background jobs in those services. Then again,
remember that those background jobs will run in multiple instances if
you scale your application. Moreover, if you add background jobs to
your web service, it will get less resources to serve pages, and your
performance will take a significant hit."
How do you configure dotcloud & your application to run just the webserver on one service, and background tasks on the worker service? Would you scale workers by increasing the concurrency setting in celery (and scaling the one service vertically), by adding extra worker services, or both?
Would you do this so that firstly the webserver service doesn't have to use resources in processing background tasks, and secondly so that you could scale the worker services independently of the webserver service?
There are two tricks.
First you could use different approots for your www and worker services to separate the code they will run:
www:
type: python
approot: frontend
# ...
worker:
type: python-worker
approot: backend
# ...
Second, since your postinstall script is different for each approot, you can copy a file out to become the correct supervisord.conf for that particular service.
You may also want to look at the dotCloud tutorial and sample code for django-celery.
/Andy
I'm interested in using Celery for an app I'm working on. It all seems pretty straight forward, but I'm a little confused about what I need to do if I have multiple load balanced application servers. All of the documentation assumes that the broker will be on the same server as the application. Currently, all of my application servers sit behind an Amazon ELB and tasks need to be able to come from any one of them.
This is what I assume I need to do:
Run a broker server on a separate instance
Configure each application instance to connect to that broker server
Each application instance will also be be a celery working (running
celeryd)?
My only beef with that is: What happens if my broker instance dies? Can I run 2 broker instances some how so I'm safe if one goes under?
Any tips or information on what to do in a setup like mine would be greatly appreciated. I'm sure I'm missing something or not understanding something.
For future reference, for those who do prefer to stick with RabbitMQ...
You can create a RabbitMQ cluster from 2 or more instances. Add those instances to your ELB and point your celeryd workers at the ELB. Just make sure you connect the right ports and you should be all set. Don't forget to allow your RabbitMQ machines to talk among themselves to run the cluster. This works very well for me in production.
One exception here: if you need to schedule tasks, you need a celerybeat process. For some reason, I wasn't able to connect the celerybeat to the ELB and had to connect it to one of the instances directly. I opened an issue about it and it is supposed to be resolved (didn't test it yet). Keep in mind that celerybeat by itself can only exist once, so that's already a single point of failure.
You are correct in all points.
How to make reliable broker: make clustered rabbitmq installation, as described here:
http://www.rabbitmq.com/clustering.html
Celery beat also doesn't have to be a single point of failure if you run it on every worker node with:
https://github.com/ybrs/single-beat