I have limited number of workers and unlimited number of queues named by mask "q.*" (e.g. q.1, q.2). I need to process them
in turn. One task per one worker. When worker finished its task, it receive new one from next existing queue.
E.g. I have queues:
q.1: task11, task12, task13
q.2: task21, task22, task23
And three workers. I expect next order of executing:
worker1: task11
worker2: task21
worker3: task12
worker1: task22
worker2: task13
worker3: task23
I tried to use topic and subscribed to mask q.* but this leads to the fact that each worker receives tasks from all queues. What is correct decision?
Think of each queue as it's own bucket of work. q.1 has no relation to q.2 at all and in fact doesn't even know it exists. It may process things at different rates from q.2 and should have different consumers. A worker on q.1 should only be concerned about q.1, it shouldn't bounce back and forth between q.1 and q.2.
Are you trying to chain 2 queues together? If so you could have something like this:
Message gets put into q.1
Message is processed by a worker (call it worker1) of q.1
After worker1 acks the message it then inserts a new message into q.2
Message is processed by a worker (call it worker2) of q.2
Related
I want to configure a cluster with the following expected behavior:
Сluster must be HA ( 3 nodes at least).
I have queues in which it is important to maintain processing order. The consumer always reads this queue in a single thread. If he took the message, then we consider our task completed.
I don't need load balancing - it is important for me to maintain the order of messages.
I want to avoid split-brain.
If we have 3 nodes, then if 1 of the nodes fails, the cluster should continue to work.
I tried following configurations:
master + slave + slave with replication.
It works. But does not solve the problem of split brain
master + slave + slave + Pinger
As far as I understand, this does not give a 100% guarantee of detecting network problems. We can also get split-brain.
3 pairs of live/backup nodes.
This is solved split brain problem but how can we avoid the following situation:
Producer send message to group A in queue (where important to maintain processing order)
Group A crashed ( 1/3 of all nodes 2/6)
The message stored in the journal of group A
Cluster continue to work;
Producer send message to group B in queue (where important to maintain processing order)
Consumer got this message first; We did not support the required message order.
How should I build a cluster to solve these problems?
You can't achieve the behavior you want using replication. You need to use a shared store between the nodes. If you must use 3 nodes then I would recommend master + slave + slave. Otherwise I'd recommend master + slave.
Also, for what it's worth, replication is not synchronous within the broker. It is asynchronous and non-blocking. However, it is still reliable. For example, when a broker is configured for HA with replication and it receives a durable message from a client it will persist that message to disk and send it to the replicated backup concurrently without blocking. However, it will wait for both operations to finish before responding to the client that it has received the message. This allows much greater message throughput than using a synchronous architecture internally although the whole process will appear to be synchronous to external clients.
Also, it's worth noting that work is underway to change how replication works to make it more robust against split brain and to enable a single master + slave pair that is suitable for production use.
What I'm trying to do is using Celery with Kubernetes. I'm using Redis as the message broker in a different pod and I have multiple pods for each queue of Celery.
Imagine if I have 3 queues, I would have 3 different pods (i.e workers) that can accept and handle the requests.
Everything is working fine so far but my question is, what would happen if I clone the pod of one of queues to have two pods for one single queue?
I think client (i.e Django) creates a new message using Redis to send to the worker and start the job but it's not clear to me what would happen because I have two pods listening to the same queue? Does the first pod accept the request and start the job and prevents the other pod to accept the request?
(I tried to search a bit on the documentation of Celery to see if I can find any clues but I couldn't. That's why I'm asking this question)
I guess you are using basic task type, which employs 'direct' queue type, not 'fanout' or 'topic' queue, the latter two have much difference, which will not be discussed here.
While using Redis as broker transport, celery/kombu use a Redis list object as a storage of queue (source), use command LPUSH to publish message, BRPOP to consume the message.
In short, BRPOP(doc) blocks the connection when there are no elements to pop from the given lists, if the list is not empty, an element is popped from the tail of the given list. It is guaranteed that this operation is atomic, no two connection could get the same element.
Celery leverage this feature to guarantees at-least-once message delivery. use of acknowledgment doesn't affect this guarantee.
In your case, there are multiple celery workers across multiple pods, but all of them connected to one same Redis server, all of them blocked for the same key, try to pop an element from the same list object. when new message arrived, there will be one and only one worker could get that message.
A task message is not removed from the queue until that message has been acknowledged by a worker. A worker can reserve many messages in advance and even if the worker is killed – by power failure or some other reason – the message will be redelivered to another worker.
More: http://docs.celeryproject.org/en/latest/userguide/tasks.html
The two workers (pods) will receive tasks and complete them independently. It's like have a single pod, but processing task at twice the speed.
I'm working on the design of a distributed system. The system consists of multiple producers, distributed queue and multiple consumers aka workers.
Workers instances resides within datacentres in different locations. Sometimes one location is manually disconnected.
In such a case, the issue is the worker from the disconnected location got some task from the queue and is then shutting down before task completion. I want:
workers from an alive location be able to got such a task and complete it
when a disconnected worker finally turns on, it should determine if the task was already completed by another worker and decide what to do with it
What is a convenient way to solve such an issue?
This design might help you. Every time a worker consumes a task, move the task from queue to some other distributed list of consumed tasks. In this list of tasks, maintain a timestamp with every task.
Then the worker that consumed the task should send some kind of still alive message every second or so (similar to Hadoop's hearbeat message) that updates the timestamp of a task in consumed tasks list. This is to indicate that the worker who consumed this task is still alive and received a message from him recently.
Now, implement a daemon to monitor this consumed tasks list and move the tasks back to queue whose timestamp is older than a threshold number of seconds (considering message losses).
I'm developing distributed system that consists of master and worker servers. There should be 2 kind of messages:
Heartbeat
Master gets state of worker and respond immediately with appropriate command. For instance:
Message from Worker to Master: "Hey there! I have data a,b,c"
Response from Master to Worker: "All ok, But throw away c - we dont need this anymore"
The participants exchange this messages with interval T.
Direct master command
Lets say client asks master to kill job #123. Here is conversation:
Message from Master to Worker: "Alarm! We need to kill job #123"
Message from Worker to Master: "No problem! Done."
Obvious that we can't predict when this message appear.
Simplest solution is that master is initiator of all communications for both messages (in case of heartbeat we will include another one from master to start exchange). But lets assume that it is expensive to do all heartbeat housekeeping on master side for N workers. And we don't want to waste our resources to keep several tcp connections to worker servers so we have just one.
Is there any solution for this constraints?
First off, you have to do some bookkeeping somewhere. Otherwise, who's going to realize that a worker has died? The natural place to put that data is on the master, if you're building a master/worker system. Otherwise, the workers could be asked to keep track of each other in a long circle, or a randomized graph. If a worker notices that their accountabilibuddy is not responding anymore, it can alert the master.
Same thing applies to the list of jobs currently running; who keeps track of that? It also scales O(n), so presumably the master doesn't have space for that either. Sharding that data out among the workers (e.g. by keeping track of what things their accountabilibuddy is supposed to be doing) only works so far; if a and b crashes, and a is the only one looking after b, you just lost the list of jobs running on b (and possibly the alert that was supposed to notify you that b crashed).
I'd recommend a distributed consensus algorithm for this kind of task. For production, use something someone else has already written; they probably know what they're doing. If it's for learning purposes, which I presume, have a look at the raft consensus algorithm. It's not too hard to understand, but still highlights a lot of the complexity in distributed systems. The simulator is gold for proper understanding.
A master/worker system will never properly work with less than O(n) resources for n workers in the face of crashing workers. By definition, the master needs to control the workers, which is an O(n) job, even if some workers manage other workers. Also, what happens if the master crashes?
Like Filip Haglund said read the raft paper you should also implement it yourself. However in a nutshell what you need to extract from it would be this. In regaurds to membership management.
You need to keep membership lists and the masters Identity on all nodes.
Raft does it's heartbeat sending on master's end it is not very expensive network wise you don't need to keep them open. Every 200 ms to a second you need to send the heartbeat if they don't reply back the Master tells the slaves remove member x from list.
However what what to do if the master dies well basically you need to preset candidate nodes. If you haven't received a heart beat within the timeout the candidate requests votes from the rest of the cluster. If you get the slightest majority you become the new leader.
If you want to join a existing cluster basically same as above if not leader respond not leader with leaders address.
I am planning to write an application which will have distributed Worker processes. One of them will be Leader which will assign tasks to other processes. Designing the Leader elelection process is quite simple: each process tries to create a ephemeral node in the same path. Whoever is successful, becomes the leader.
Now, my question is how to design the process of distributing the tasks evenly? Any recipe for this?
I'll elaborate a little on the environment setup:
Suppose there are 10 worker maschines, each one runs a process, one of them become leader. Tasks are submitted in the queue, the Leader takes them and assigns to a worker. The worker processes gets notified whenever a tasks is submitted.
I am not sure I understand your algorithm for Leader election, but the recommended way of implementing this is to use sequential ephemeral nodes and use the algorithm at http://zookeeper.apache.org/doc/r3.3.3/recipes.html#sc_leaderElection which explains how to avoid the "herd" effect.
Distribution of tasks can be done with a simple distributed queue and does not strictly need a Leader. The producer enqueues tasks and consumers keep a watch on the tasks node - a triggered watch will lead the consumer to take a task and delete the associated znode. There are certain edge conditions to consider with requeuing tasks from failed consumers. http://zookeeper.apache.org/doc/r3.3.3/recipes.html#sc_recipes_Queues
I would recommend the section Example: Master-Worker Application of this book ZooKeeper Distributed Process Coordination http://shop.oreilly.com/product/0636920028901.do
The example demonstrates to distribute tasks to worker using znodes and common zookeeper commands.
Consider using an actor singleton service pattern. For example, in Scala there is Akka which solves this class of problem with less code.