Does Celery task code need to exist on multiple machines? - celery

Trying to wrap my head around Celery. So far I understand that there is the Client, Broker, and Worker(s). For simplicity, I'm looking at a Dockerized version of this setup.
If I understand correctly, the client enqueues a task on the broker, then the worker continuously attempts to pop from the broker and process the task.
In the example, it seems like both the Client (in this case a Flask app) and the Worker, reference the exact same code. Is that correct?
Does this mean that if each of the components were broken out into their own machines that the code would need to be deployed to both Client and Worker machines at the same time? It seems strange that these pieces would need to access the same task/function to do their respective work.

This is one thing I was initially confused by as well. The tasks' code doesn't have to be present on both, only the worker needs the code to do the actual work. When you say that client enqueues the task on broker which worker then executes, it's crucial to understand how this works. The client only sends a message to the broker, not an actual task. The message contains task name, arguments and other stuff. So the client needs to know just these parameters about the task to enqueue it. Then, it can use send_task to enqueue the task without knowing.
This is how I employ Celery in a simple jobber application where I want to decouple the pieces as much as possible. I have a Flask app that serves as a UI for the jobs which users can manage (create, see the state/progress etc.) The Flask app uses APScheduler to actually run the jobs (where a job is nothing else than a Celery task). Now, I want the client part (Flask app + scheduler) to know only as little as possible about the tasks to run them as jobs. That means their names and arguments they can take. To make it really independent of the tasks' code, I get this information from the workers via the broker. You can see a little bit more background from this issue I initially created.

Related

Distributed systems with large number of different types of jobs

I want to create a distributed system that can support around 10,000 different types of jobs. One single machine can host only 500 such jobs, as each job needs some data to be pre-loaded into memory, which can't be kept in a cache. Each job must have redundancy for availability.
I had explored open-source libraries like zookeeper, hadoop, but none solves my problem.
The easiest solution that I can think of, is to maintain a map of job type, with its hosted machine. But how can I support dynamic allocation of job type on my fleet? How to handle machine failures, to make sure that each job type must be available on atleast 1 machine, at any point of time.
Based on the answers that you mentioned in the comments, I propose you to go for a MQ-based (Message Queue) architecture. What I propose in this answer is to:
Get the input from users and push them into a distributed message queue. It means that you should set up a message queue (Such as ActiveMQ or RabbitMQ) on several servers. This MQ technology, helps you to replicate the input requests for fault tolerance issues. It also provides a full end-to-end asynchronous system.
After preparing this MQ layer, you can setup you computing servers layers. This means that some computing servers (~20 servers in your case) will read the requests from the message queue and start a job based on the request. Because this MQ is distributed, you can make sure that a good level of load balancing can happen in your computing servers. In addition, each server is capable of running as much as jobs that you want (~500 in your case) based on the requests that it reads from the MQ.
Regarding the failures, the computing servers may only pop from the MQ, if and only if the job is completed. If one server is crashing, the job is still in the MQ and another server can work on it. If the job is saving some state somewhere or updates something, you should manage its duplicate run then.
The good point about this approach is that it is very salable. It means that if in future you have more jobs to handle, by adding a computing server and connecting it to the MQ, you can process more requests on the servers without any change to the system. In addition, some nice features in the MQ like priority-based queuing, helps you to prioritize the requests and process them based on the job type.
p.s. Your Q does not provide any details about the type and parameters of the system. This is a draft solution that I can propose. If you provide more details, maybe the community can help you more.

The right way to call fire-and-forget method on a service-fabric service

I have a method on ServiceA that I need to call from ServiceB. The method takes upwards of 5 minutes to execute and I don't care about its return value. (Output from the method is handled another way)
I have setup my method in IServiceA like this:
[OneWay]
Task LongRunningMethod(int param1);
However that doesn't appear to run, because I am getting System.TimeoutException: This can happen if message is dropped when service is busy or its long running operation and taking more time than configured Operation Timeout.
One choice is to increase the timeout, but it seems that there should be a better way.
Is there?
For fire and forget or long running operations the best solution is using a message bus as a middle-ware that will handle this dependency between both process.
To do what you want without a middle-ware, your caller would have to worry about many things, like: Timeouts (like in your case), delivery guarantee(confirmation), Service availability, Exceptions and so on.
With the middle-ware the only worry your application logic need is the delivery guarantee, the rest should be handled by the middle-ware and the receiver.
There are many options, like:
Azure Service Bus
Azure Storage Queue
MSMQ
Event Hub
and so on.
I would not recommend using the SF Communication, Task.Run(), Threads workarounds as many places suggests, because they will just bring you extra work and wont run as smooth as the middle-ware approach.

How to submit "tasks" in paralell on a server

first happy new year to everybody and happy coding for 2017.
I have 1M of "tasks" to run using python. Each task will take around 2 min and will process some local images. I would like to run as much as possible in parallel in an automatic way. My server has 40 cores so I started to see how to do multiprocessing but I see the following issues:
Keeping the log of each task is not easy (I am working on it but so far I didn't succeed even if I found many example on stackoverflow)
How to I know how many CPU should I use and how many should be left to the server for basic server task ?
When we have multi user on the server how can we see how many CPU are already used ?
In my previous life as physicist at CERN we were using job submission system to submit tasks on many clusters. Tasks were put in a queue and process when a slot was available. Do we have such tool for a LINUX sever as well? I don't know what is the correct English name for such tool (job dispatcher ?).
The best will be a tool that we can configure to use our N CPU as "vehicle" to process in parallel task (and that keep the needed CPU so that the server can run basic task as well), put the job of all users in a queues with priority and process them "vehicle" are available. Bonus will be a way to monitor task processing.
I hope I am using the correct word to describe what I want.
Thanks
Fabien
What you are talking about is generally referred as "Pool of Workers". It can be implemented using Threads or Processes. The implementation choice depends on your workflow.
A pool of workers allows you to choose the number of workers to use. Furthermore, the pool usually has a queue in front of the workers to de-couple them from your main logic.
If you want to run tasks within a single server, then you can either use multiprocessing.Pool or concurrent.futures.Executor.
If you want to distribute tasks over a cluster, there are several solutions. Celery and Luigi are good examples.
EDIT:
This is not your concern as a User. Modern Operating Systems do a pretty good job in sharing resources between multiple Users. If overcommitting resources becomes a concern, the SysAdmin should make sure this does not happen by assigning quotas per User. This can be done in plenty of ways. An example tool sysadmins should be familiar with is ulimit.
To put it in other words: your software should not do what Operating Systems are for: abstracting the underlying machine to offer to your software a "limitless" set of resources. Whoever manages the server should be the person telling you: "use at most X CPUs".
Probably, what you were using at CERN was a system like Mesos. These solutions aggregate large clusters in a single set of resources which you can schedule tasks against. This works if all the users are accessing to the cluster through it though.
If you are sharing a server with other people, either you agree together on the quotas or you all adopt a common scheduling framework such as Celery.

Behaviour when reducing instances of a Bluemix application

I have an orchestrator service which keeps track of the instances that are running and what request they are currently dealing with. If a new instance is required, I make a REST call to increase the instances and wait for the new instance to connect to the orchestrator. It's one request per instance.
The orchestrator tracks whether an instance is doing anything and knows which instances can be stopped, however there is nothing in the API that allows me to reduce the number of instances stopping a particular instance, which is what I am trying to achieve.
Is there anything I can do to manipulate the platform into deterministically stopping the instances that I want to stop? Perhaps by having long running HTTP requests to the instances I require and killing the request when it's no longer required, then making the API call to reduce the number of instances?
Part of the issue here is that I don't know the specifics of the current behavior...
Assuming you're talking about CloudFoundry/Instant Runtime applications, all of the instances of an applications are running behind a load balancer which uses round-robin to distribute requests across the instances (unless you have session affinity cookie set up). Differentiating between each instances for incoming requests or manual scaling is not recommended and it's an anti-pattern. You can not control which instance the scale down task will choose.
If you really want that level of control with each instance, maybe you should deploy them as separate applications. MyApp1, MyApp2, MyApp3, etc. All of your applications can have the same route (myapp.mybluemix.net). Each of the applications can now distinguish themselves by their name (VCAP_APPLICATION) allowing you terminate them.

How to message/work queues work and what's the justification for a dedicated/hosted queue service

I'm getting into utilising work queues for my app and can easily understand their benefit: computationally or time-intensive tasks (like sending an email) can be put in a queue and run asynchronously to the request that initiated the task so that it can get on with responding to the user.
For example in my Laravel app I am using redis queueing. I push a job into onto the queue and a separate artisan process which is running on the server (artisan queue:listen) is listening on the queue for new jobs for it to execute.
My confusion is when it comes to the actually queuing system. Surely the queue worker is just a list of jobs with any pertinent data serialised and passed through. The job itself is still computed by the application.
So this leads me to wonder about the benefit and cost of large-scale queue workers like Iron.io or Amazon SQS. These services cost a fair amount for what seems like a fairly straightforward and computationally minimal task. Even with thousands of jobs a minute a simple redis or beanstalkd queue on the same server will surely be handled far easier than the actual jobs themselves. Having a hosted system seems like it'll slow down the process more due to the latency between servers.
Can anyone explain the workings of a work queue and how these hosted services improve the performance of such an application. I imagine the benefit is only felt once an app has scaled sufficiently but more insight would be helpful.