Using Celery with multiple workers in different pods - kubernetes

What I'm trying to do is using Celery with Kubernetes. I'm using Redis as the message broker in a different pod and I have multiple pods for each queue of Celery.
Imagine if I have 3 queues, I would have 3 different pods (i.e workers) that can accept and handle the requests.
Everything is working fine so far but my question is, what would happen if I clone the pod of one of queues to have two pods for one single queue?
I think client (i.e Django) creates a new message using Redis to send to the worker and start the job but it's not clear to me what would happen because I have two pods listening to the same queue? Does the first pod accept the request and start the job and prevents the other pod to accept the request?
(I tried to search a bit on the documentation of Celery to see if I can find any clues but I couldn't. That's why I'm asking this question)

I guess you are using basic task type, which employs 'direct' queue type, not 'fanout' or 'topic' queue, the latter two have much difference, which will not be discussed here.
While using Redis as broker transport, celery/kombu use a Redis list object as a storage of queue (source), use command LPUSH to publish message, BRPOP to consume the message.
In short, BRPOP(doc) blocks the connection when there are no elements to pop from the given lists, if the list is not empty, an element is popped from the tail of the given list. It is guaranteed that this operation is atomic, no two connection could get the same element.
Celery leverage this feature to guarantees at-least-once message delivery. use of acknowledgment doesn't affect this guarantee.
In your case, there are multiple celery workers across multiple pods, but all of them connected to one same Redis server, all of them blocked for the same key, try to pop an element from the same list object. when new message arrived, there will be one and only one worker could get that message.

A task message is not removed from the queue until that message has been acknowledged by a worker. A worker can reserve many messages in advance and even if the worker is killed – by power failure or some other reason – the message will be redelivered to another worker.
More: http://docs.celeryproject.org/en/latest/userguide/tasks.html
The two workers (pods) will receive tasks and complete them independently. It's like have a single pod, but processing task at twice the speed.

Related

What's the point of having a single celery worker with multiple queues?

continuing How does a Celery worker consuming from multiple queues decide which to consume from first?
I've setup a single worker and have it listen to two queues. I understand from the above linked question that the worker would consume messages from those two queues in round-robin or in the order they arrived (depending on celery version).
So what's the purpose of this setting? Why is it different than a single queue? Would that be helpful only for monitoring, or is there an operational benefit i'm missing here?
In most scenarios you will have your worker subscribed only to a single queue, however there are scenarios when having ability to subscribe to multiple queues makes sense.
Here is one. Imagine you have a Celery cluster of 10 machines. They perform various tasks, and among them there is a task that downloads files from remote file-server. However, the owner of the file-server whitelisted only two of your 10 machine IPs, so basically only two of them can download files from that particular file-server. Typically you will have Celery workers on these two machines subcribe to an additional queue, called "download" for an example, and schedule download tasks by sending them to the "download" queue.
This is a very common scenario where a subset of your nodes can do particular thing (access remote servers - file servers, database servers, etc).
One could argue "why not have just the 'download' queue on these two machines?" - that may be a waste of resources.

FlowFiles stuck in the queue in NiFi Cluster

I am currently running NiFi 1.9.2 in a clustered environment with 3 nodes. Recently what I have noticed is that the flow seems to get stuck. The queue shows that there are items in the queue, but nothing is going to the downstream processor. When I list the items in the queue, I get "The queue has no FlowFiles".
The queue in this case is set to load balance with round robin. If I stop the downstream processor, and change the configuration on the queue to not to load balance, and then switch it back to round robin again, the queue items distribute to the other two nodes, and I can see the flow files when I list the items in the queue. However, it only shows items as being in two of the nodes. When I restart the downstream processor, the 2/3 of the items get processed leaving the 1/3 that would be on the node whose queue items I cannot see. This behavior seems to persist even after restarting the cluster service.
If I change the queue to not to load balance, then everything seems to get put on a good node, and the queue get emptied. So it looks like there might be something not correct on my first node.
Any suggestions on what to try?
Thanks,
-tj
You should check disk usages. If usage rate of the disk nifi located is equal or higher than the "nifi.content.repository.archive.max.usage.percentage" setting in nifi.properties file, you may see this NiFi's strange behavior. If you have this kind of situation, you can try to delete old log files of NiFi

Communication protocol

I'm developing distributed system that consists of master and worker servers. There should be 2 kind of messages:
Heartbeat
Master gets state of worker and respond immediately with appropriate command. For instance:
Message from Worker to Master: "Hey there! I have data a,b,c"
Response from Master to Worker: "All ok, But throw away c - we dont need this anymore"
The participants exchange this messages with interval T.
Direct master command
Lets say client asks master to kill job #123. Here is conversation:
Message from Master to Worker: "Alarm! We need to kill job #123"
Message from Worker to Master: "No problem! Done."
Obvious that we can't predict when this message appear.
Simplest solution is that master is initiator of all communications for both messages (in case of heartbeat we will include another one from master to start exchange). But lets assume that it is expensive to do all heartbeat housekeeping on master side for N workers. And we don't want to waste our resources to keep several tcp connections to worker servers so we have just one.
Is there any solution for this constraints?
First off, you have to do some bookkeeping somewhere. Otherwise, who's going to realize that a worker has died? The natural place to put that data is on the master, if you're building a master/worker system. Otherwise, the workers could be asked to keep track of each other in a long circle, or a randomized graph. If a worker notices that their accountabilibuddy is not responding anymore, it can alert the master.
Same thing applies to the list of jobs currently running; who keeps track of that? It also scales O(n), so presumably the master doesn't have space for that either. Sharding that data out among the workers (e.g. by keeping track of what things their accountabilibuddy is supposed to be doing) only works so far; if a and b crashes, and a is the only one looking after b, you just lost the list of jobs running on b (and possibly the alert that was supposed to notify you that b crashed).
I'd recommend a distributed consensus algorithm for this kind of task. For production, use something someone else has already written; they probably know what they're doing. If it's for learning purposes, which I presume, have a look at the raft consensus algorithm. It's not too hard to understand, but still highlights a lot of the complexity in distributed systems. The simulator is gold for proper understanding.
A master/worker system will never properly work with less than O(n) resources for n workers in the face of crashing workers. By definition, the master needs to control the workers, which is an O(n) job, even if some workers manage other workers. Also, what happens if the master crashes?
Like Filip Haglund said read the raft paper you should also implement it yourself. However in a nutshell what you need to extract from it would be this. In regaurds to membership management.
You need to keep membership lists and the masters Identity on all nodes.
Raft does it's heartbeat sending on master's end it is not very expensive network wise you don't need to keep them open. Every 200 ms to a second you need to send the heartbeat if they don't reply back the Master tells the slaves remove member x from list.
However what what to do if the master dies well basically you need to preset candidate nodes. If you haven't received a heart beat within the timeout the candidate requests votes from the rest of the cluster. If you get the slightest majority you become the new leader.
If you want to join a existing cluster basically same as above if not leader respond not leader with leaders address.

Queue processing one by one using RabbitMQ

I have limited number of workers and unlimited number of queues named by mask "q.*" (e.g. q.1, q.2). I need to process them
in turn. One task per one worker. When worker finished its task, it receive new one from next existing queue.
E.g. I have queues:
q.1: task11, task12, task13
q.2: task21, task22, task23
And three workers. I expect next order of executing:
worker1: task11
worker2: task21
worker3: task12
worker1: task22
worker2: task13
worker3: task23
I tried to use topic and subscribed to mask q.* but this leads to the fact that each worker receives tasks from all queues. What is correct decision?
Think of each queue as it's own bucket of work. q.1 has no relation to q.2 at all and in fact doesn't even know it exists. It may process things at different rates from q.2 and should have different consumers. A worker on q.1 should only be concerned about q.1, it shouldn't bounce back and forth between q.1 and q.2.
Are you trying to chain 2 queues together? If so you could have something like this:
Message gets put into q.1
Message is processed by a worker (call it worker1) of q.1
After worker1 acks the message it then inserts a new message into q.2
Message is processed by a worker (call it worker2) of q.2

How to design task distribution with ZooKeeper

I am planning to write an application which will have distributed Worker processes. One of them will be Leader which will assign tasks to other processes. Designing the Leader elelection process is quite simple: each process tries to create a ephemeral node in the same path. Whoever is successful, becomes the leader.
Now, my question is how to design the process of distributing the tasks evenly? Any recipe for this?
I'll elaborate a little on the environment setup:
Suppose there are 10 worker maschines, each one runs a process, one of them become leader. Tasks are submitted in the queue, the Leader takes them and assigns to a worker. The worker processes gets notified whenever a tasks is submitted.
I am not sure I understand your algorithm for Leader election, but the recommended way of implementing this is to use sequential ephemeral nodes and use the algorithm at http://zookeeper.apache.org/doc/r3.3.3/recipes.html#sc_leaderElection which explains how to avoid the "herd" effect.
Distribution of tasks can be done with a simple distributed queue and does not strictly need a Leader. The producer enqueues tasks and consumers keep a watch on the tasks node - a triggered watch will lead the consumer to take a task and delete the associated znode. There are certain edge conditions to consider with requeuing tasks from failed consumers. http://zookeeper.apache.org/doc/r3.3.3/recipes.html#sc_recipes_Queues
I would recommend the section Example: Master-Worker Application of this book ZooKeeper Distributed Process Coordination http://shop.oreilly.com/product/0636920028901.do
The example demonstrates to distribute tasks to worker using znodes and common zookeeper commands.
Consider using an actor singleton service pattern. For example, in Scala there is Akka which solves this class of problem with less code.