Celery Group task for use in a map/reduce workflow - celery

Can I use a Celery Group primitive as the umbrella task in a map/reduce workflow?
Or more specific: Can the subtasks in a Group be run on multiple workers on multiple servers?
From the docs:
However, if you call apply_async on the group it will send a special
grouping task, so that the action of calling the tasks happens in a worker
instead of the current process
That seems to imply the tasks are all send to one worker...
Before 3.0 (and still) one could fire off the subtasks in a TaskSet which would run on multiple servers. The problem is determining whether all tasks have finished executing. That is normally done by polling all subtasks which is not really elegant.
I am wondering if the Group primitive can be used to mitigate this problem.

I found out it is possible to use Chords for such a map reduce like problem.
#celery.task(name='ic.mapper')
def mapper():
#split your problem in embarrassingly parallel maps
maps = [map.s(), map.s(), map.s(), map.s(), map.s(), map.s(), map.s(), map.s()]
#and put them in a chord that executes them in parallel and after they finish calls 'reduce'
mapreduce = celery.chord(maps)(reduce.s())
return "{0} mapper ran on {1}".format(celery.current_task.request.id, celery.current_task.request.hostname)
#celery.task(name='ic.map')
def map():
#do something useful here
import time
time.sleep(10.0)
return "{0} map ran on {1}".format(celery.current_task.request.id, celery.current_task.request.hostname)
#celery.task(name='ic.reduce')
def reduce(results):
#put the maps together and do something with the results
return "{0} reduce ran on {1}".format(celery.current_task.request.id, celery.current_task.request.hostname)
When the mapper is executed on a cluster of three workers/servers it first executes the mapper which splits your problem and the creates new subtasks that are again submitted to the broker. These run in parallel because the queue is consumed by all brokers. Also an chord task is created that polls all maps to see if they have finished. When done the reduce task is executed where you can glue your results back together.
In all: yes it is possible. Thanks for the vegetable guys!

Related

Wait for all LSF jobs with given name, overriding JOB_DEP_LAST_SUB = 1

I've got a large computational task, consisting of several steps, that I run on a PC cluster, managed by LSF.
Part of this task includes launching several parallel jobs with identical names. Jobs are somewhat different, therefore it is hard to transform them to a job array.
The next step of this computation, following these jobs, summarizes their results, therefore it must wait until all of them are finished.
I'm trying to use -w ended(job-name) command line switch for bsub, as usual, to specify job dependencies.
However, admins of the cluster have set JOB_DEP_LAST_SUB = 1 in lsb.params.
According to the LSF manual, this makes LSF to wait for only one most recent job with supplied name to complete, instead of all jobs.
Is it possible to override this behavior for my task only without asking admins to reconfigure the whole cluster (this cluster is used by many people, it is very unlikely that they agree)?
I cannot find any clues in the manual.
Looks like it cannot be overridden.
I've changed job names to make them unique by appending random value, then I've changed condition to -w ended(job-name-*)

Give an entire Celery chain priority over new tasks

I want to launch a chain of Celery tasks, and have them all execute before any newly arriving tasks do. I'll have a single worker process handling all tasks.
I guess the easiest thing to do would be to not make them a chain at all, but instead launch a single task that synchronously calls a sequence of functions. But I'd like to take advantage of Celery retries, allowing each task to be retried a different number of times.
What's the best way to do this?
If you have a single worker running a single process then as far as I can tell from working with celery (this is not explicitly documented) you should get the behavior you want.
If you want to use multiple worker processes then you may need to set CELERYD_PREFETCH_MULTIPLIER to 1.

Apache-Spark Internal Job Scheduling

I came across the feature in Spark where it allows you to schedule different tasks within a spark context.
I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key valueRDD [(K1,K2),V] and a filtered RDD containing some specific values.
Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation but got more confused with the concept of pools, users and tasks.
What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group
What are users in this context. Do they refer to threads? or is it something like SQL context queries ?
I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups.
Can someone please clarify this?
All the pipelined procedure you described in Paragraph 2:
map -> map -> map -> filter
will be handled in a single stage, just like a map() in MapReduce if it is familiar to you. It's because there isn't a need for repartition or shuffle your data for your make no requirements on the correlation between records, spark would just chain as much transformation as possible into a same stage before create a new one, because it would be much lightweight. More informations on stage separation could be find in its paper: Resilient Distributed Datasets Section 5.1 Job Scheduling.
When the stage get executed, it would be one task set (same tasks running in different thread), and get scheduled simultaneously in spark's perspective.
And Fair scheduler is about to schedule unrelated task sets and not suitable here.

Spring Batch - Executing multiple instances of a job at same time

I have a clarification.
Is it possible for us to run multiple instances of a job at the same time.
Currently, we have single instance of a job at any given time.
If it is possible, please let me know how to do it.
Yes you can. Spring Batch distinguishes jobs based on the JobParameters. So if you always pass different JobParameters to the same job, you will have multiple instances of the same job running.
A simple way is just to add a UUID parameter to each request to start a job.
Example:
final JobParametersBuilder jobParametersBuilder = new JobParametersBuilder();
jobParametersBuilder.addString("instance_id", UUID.randomUUID().toString(), true);
jobLauncher.run(job,jobParametersBuilder.toJobParameters());
The boolean 'true' at the end signal to Spring Batch to use that parameter as part of the 'identity' of the instance of the job, so you will always get new instances with each 'run' of the job.
Yes you can very much run tasks in parallel as also documented here
But there are certain things to be considered
Does your application logic needs parallel execution? Because if if you are going to run steps in parallel, you would have to take care and build application logic so that the work done by parallel steps is not overlapping (Unless that is the intention of your application)
Yes, it's completely possible to have multiple instances (or executions) of a job run concurrently.

Running two instances of the ScheduledThreadPoolExecutor

I have a number of asynchronous tasks to run in parallel. All the tasks can be divided into two types, lets call one - type A (that are time consuming) and everything else type B (faster and quick to execute ones).
with a single ScheduledThreadPoolExecutor with x poolsize, eventually at some point all threads are busy executing type A, as a resul type B gets blocked and delayed.
what im trying to accomplish is to run a type A tasks parallel to type B, and i want tasks in both the types to run parallel within their group for performance .
Would you think its prudent to have two instances of ScheduledThreadPoolExecutor for the type A and B exclusively with their own thread pools ? Do you see any issues with this approach?
No, that's seems reasonable.
I am doing something similar i.e. I need to execute tasks in serial fashion depending on some id e.g. all the tasks which are for component with id="1" need to be executed serially to each another and in parallel to all other tasks which are for components with different ids.
so basically I need a separate queue of tasks for each different component, the tasks are pulled one after another from each specific queue.
In order to achieve that I use
Executors.newSingleThreadExecutor(new JobThreadFactory(componentId));
for each component.
Additionally I need ExecutorService for a different type of tasks which are not bound to componentIds, for that I create additional ExecutorService instance
Executors.newFixedThreadPool(DEFAULT_THREAD_POOL_SIZE, new JobThreadFactory());
This works fine for my case at least.
The only problem I can think of if there is a need of ordered execution of the tasks i.e.
task2 NEEDS to be executed after task1 and so on... But I doubt this the case here ...