Retrying celery failed tasks that are part of a chain - celery

I have a celery chain that runs some tasks. Each of the tasks can fail and be retried. Please see below for a quick example:
from celery import task
#task(ignore_result=True)
def add(x, y, fail=True):
try:
if fail:
raise Exception('Ugly exception.')
print '%d + %d = %d' % (x, y, x+y)
except Exception as e:
raise add.retry(args=(x, y, False), exc=e, countdown=10)
#task(ignore_result=True)
def mul(x, y):
print '%d * %d = %d' % (x, y, x*y)
and the chain:
from celery.canvas import chain
chain(add.si(1, 2), mul.si(3, 4)).apply_async()
Running the two tasks (and assuming that nothing fails), your would get/see printed:
1 + 2 = 3
3 * 4 = 12
However, when the add task fails the first time and succeeds in subsequent retry calls, the rest of the tasks in the chain do not run, i.e. the add task fails, all other tasks in the chain are not run and after a few seconds, the add task runs again and succeeds and the rest of the tasks in the chain (in this case mul.si(3, 4)) does not run.
Does celery provide a way to continue failed chains from the task that failed, onwards? If not, what would be the best approach to accomplishing this and making sure that a chain's tasks run in the order specified and only after the previous task has executed successfully even if the task is retried a few times?
Note 1: The issue can be solved by doing
add.delay(1, 2).get()
mul.delay(3, 4).get()
but I am interested in understanding why chains do not work with failed tasks.

You've found a bug :)
Fixed in https://github.com/celery/celery/commit/b2b9d922fdaed5571cf685249bdc46f28acacde3
will be part of 3.0.4.

I'm also interested in understanding why chains do not work with failed tasks.
I dig some celery code and what I've found so far is:
The implementation happends at app.builtins.py
#shared_task
def add_chain_task(app):
from celery.canvas import chord, group, maybe_subtask
_app = app
class Chain(app.Task):
app = _app
name = 'celery.chain'
accept_magic_kwargs = False
def prepare_steps(self, args, tasks):
steps = deque(tasks)
next_step = prev_task = prev_res = None
tasks, results = [], []
i = 0
while steps:
# First task get partial args from chain.
task = maybe_subtask(steps.popleft())
task = task.clone() if i else task.clone(args)
i += 1
tid = task.options.get('task_id')
if tid is None:
tid = task.options['task_id'] = uuid()
res = task.type.AsyncResult(tid)
# automatically upgrade group(..) | s to chord(group, s)
if isinstance(task, group):
try:
next_step = steps.popleft()
except IndexError:
next_step = None
if next_step is not None:
task = chord(task, body=next_step, task_id=tid)
if prev_task:
# link previous task to this task.
prev_task.link(task)
# set the results parent attribute.
res.parent = prev_res
results.append(res)
tasks.append(task)
prev_task, prev_res = task, res
return tasks, results
def apply_async(self, args=(), kwargs={}, group_id=None, chord=None,
task_id=None, **options):
if self.app.conf.CELERY_ALWAYS_EAGER:
return self.apply(args, kwargs, **options)
options.pop('publisher', None)
tasks, results = self.prepare_steps(args, kwargs['tasks'])
result = results[-1]
if group_id:
tasks[-1].set(group_id=group_id)
if chord:
tasks[-1].set(chord=chord)
if task_id:
tasks[-1].set(task_id=task_id)
result = tasks[-1].type.AsyncResult(task_id)
tasks[0].apply_async()
return result
def apply(self, args=(), kwargs={}, **options):
tasks = [maybe_subtask(task).clone() for task in kwargs['tasks']]
res = prev = None
for task in tasks:
res = task.apply((prev.get(), ) if prev else ())
res.parent, prev = prev, res
return res
return Chain
You can see that at the end prepare_steps prev_task is linked to the next task.
When the prev_task failed the next task is not called.
I'm testing with adding the link_error from prev task to the next:
if prev_task:
# link and link_error previous task to this task.
prev_task.link(task)
prev_task.link_error(task)
# set the results parent attribute.
res.parent = prev_res
But then, the next task must take care of both cases (maybe, except when it's configured to be immutable, e.g. not accept more arguments).
I think chain can support that by allowing some syntax likes this:
c = chain(t1, (t2, t1e), (t3, t2e))
which means:
t1 link to t2 and link_error to t1e
t2 link to t3 and link_error to t2e

Related

Limit the number of single processes in Nextflow workflows

I have the following simple workflow:
workflow {
Channel.fromPath(params.file_list)
.splitText(){it.trim()}
.set { file_list }
data = GetFromHPSS(file_list)
data_pairs = CoupleDETXToFile(data, file(params.detx_path))
SingleDUTimeResFit(data_pairs)
}
In which file_list is a list of paths on a tape-drive system. The GetFromHPSS is the process which retrieves files from the tape system and I need to limit the parallel processes to a fairly low number.
Currently, I am using
executor {
queueSize = 100
}
in the configuration file but there are two problems:
it limits the overall maximum number of parallel jobs, while I could run thousands of SingleDUTimeResFit processes in parallel
it always first waits until it processed everything from GetFromHPSS instead of continuing with the subsequent processes
Here is an example:
N E X T F L O W ~ version 21.04.3
Launching `workflows/singledu_timeresfit.nf` [wise_galileo] - revision: 8084ac1482
executor > sge (502)
[13/ca3e8a] process > GetFromHPSS (426) [ 18%] 402 of 22840
[- ] process > CoupleDETXToFile [ 0%] 0 of 402
[- ] process > SingleDUTimeResFit -
Is there a way to limit GetFromHPSS to a specific number of parallel executions and let the remaining processes run with another queue-limit set?
EDIT: This is one of my best tries I guess, but it does not accept the configuration:
process {
executor {
queueSize = 100
submitRateLimit = "10sec"
}
withName: GetFromHPSS {
executor.queueSize = 10
}
}
With this process top-level configuration, I get:
N E X T F L O W ~ version 21.04.3
Launching `workflows/singledu_timeresfit.nf` [confident_pasteur] - revision: 8084ac1482
Unknown config attribute `process.withName:GetFromHPSS` -- check config file: /sps/km3net/users/tgal/dev/PhD/workflows/nextflow.config
I think what you're looking for here is the maxForks directive, which can be applied to just the 'GetFromHPSS' process without the need to change the executor's queueSize:
process 'GetFromHPSS' {
maxForks 1
"""
<your script here>
"""
}
You could even parameterize it, if you think it makes sense:
params.hpss_forks = 5
process 'GetFromHPSS' {
maxForks params.hpss_forks
"""
<your script here>
"""
}

How do I make Simpy simulation to depict a markovian M/M/1 process?

output printing the len of arrival and service timesI am trying to implement an M/M/1 markovian process with exponential inter arrival and exponential service times using simpy. The code runs fine but I dont quite get the expected results. Also the number of list items in arrival times is lesser than the number of list items in service time after the code is run.
# make a markovian queue
# make a server as a resource
# make customers at random times
# record the customer arrival time
# customer gets the resource
# record when the customer got the resource
# serve the customers for a random time using resource
# save this random time as service time
# customer yields the resource and next is served
import statistics
import simpy
import random
arrival_time = []
service_time = []
mean_service = 2.0
mean_arrival = 1.0
num_servers = 1
class Markovian(object):
def __init__(self, env, num_servers):
self.env = env
self.servers = simpy.Resource(env, num_servers)
#self.action = env.process(self.run())
def server(self,packet ):
#timeout after random service time
t = random.expovariate(1.0/mean_service)
#service_time.append(t)
yield self.env.timeout(t)
def getting_service(env, packet, markovian):
# new packet arrives in the system
arrival_time = env.now
with markovian.servers.request() as req:
yield req
yield env.process(markovian.server(packet))
service_time.append(env.now - arrival_time)
def run_markovian(env,num_servers):
markovian = Markovian(env,num_servers)
packet = 0
#generate new packets
while True:
t = random.expovariate(1.0/mean_arrival)
arrival_time.append(t)
yield env.timeout(t)
packet +=1
env.process(Markovian.getting_service(env,packet,markovian))
def get_average_service_time(service_time):
average_service_time = statistics.mean(service_time)
return average_service_time
def main():
random.seed(42)
env= simpy.Environment()
env.process(Markovian.run_markovian(env,num_servers))
env.run(until = 50)
print(Markovian.get_average_service_time(service_time))
print (arrival_time)
print (service_time)
if __name__ == "__main__":
main()
Hello there were basically one bug in your code and two queuing theory misconceptions:
Bug 1) the definition of the servers were inside the class, this makes the model behaves as a M/M/inf not M/M/1
Answer: I put the definition of your resources out the the class, and pass the servers not the num_servers from now on.
Misconception 1: with the times as you defined:
mean_service = 2.0
mean_arrival = 1.0
The system will generate much more packets and it is able to serve. That's why the size of the lists were so different.
Answer:
mean_service = 1.0
mean_arrival = 2.0
Misconception 2:
What you call service time in your code is actually system time.
I also put some prints in your code so we could see that is doing. Fell free to comment them. And there is no need for the library Statistics, so I commented it too.
I hope this answer is useful to you.
# make a markovian queue
# make a server as a resource
# make customers at random times
# record the customer arrival time
# customer gets the resource
# record when the customer got the resource
# serve the customers for a random time using resource
# save this random time as service time
# customer yields the resource and next is served
#import statistics
import simpy
import random
arrivals_time = []
service_time = []
waiting_time = []
mean_service = 1.0
mean_arrival = 2.0
num_servers = 1
class Markovian(object):
def __init__(self, env, servers):
self.env = env
#self.action = env.process(self.run())
#def server(self,packet ):
#timeout after random service time
# t = random.expovariate(1.0/mean_service)
#service_time.append(t)
# yield self.env.timeout(t)
def getting_service(env, packet, servers):
# new packet arrives in the system
begin_wait = env.now
req = servers.request()
yield req
begin_service = env.now
waiting_time.append(begin_service - begin_wait)
print('%.1f Begin Service of packet %d' % (begin_service, packet))
yield env.timeout(random.expovariate(1.0/mean_service))
service_time.append(env.now - begin_service)
yield servers.release(req)
print('%.1f End Service of packet %d' % (env.now, packet))
def run_markovian(env,servers):
markovian = Markovian(env,servers)
packet = 0
#generate new packets
while True:
t = random.expovariate(1.0/mean_arrival)
yield env.timeout(t)
arrivals_time.append(t)
packet +=1
print('%.1f Arrival of packet %d' % (env.now, packet))
env.process(Markovian.getting_service(env,packet,servers))
def get_average_service_time(service_time):
average_service_time = statistics.mean(service_time)
return average_service_time
def main():
random.seed(42)
env= simpy.Environment()
servers = simpy.Resource(env, num_servers)
env.process(Markovian.run_markovian(env,servers))
env.run(until = 50)
print(Markovian.get_average_service_time(service_time))
print ("Time between consecutive arrivals \n", arrivals_time)
print("Size: ", len(arrivals_time))
print ("Service Times \n", service_time)
print("Size: ", len(service_time))
print ("Waiting Times \n", service_time)
print (waiting_time)
print("Size: ",len(waiting_time))
if __name__ == "__main__":
main()

Simpy: How can I represent failures in a train subway simulation?

New python user here and first post on this great website. I haven't been able to find an answer to my question so hopefully it is unique.
Using simpy I am trying to create a train subway/metro simulation with failures and repairs periodically built into the system. These failures happen to the train but also to signals on sections of track and on plaforms. I have read and applied the official Machine Shop example (which you can see resemblance of in the attached code) and have thus managed to model random failures and repairs to the train by interrupting its 'journey time'.
However I have not figured out how to model failures of signals on the routes which the trains follow. I am currently just specifying a time for a trip from A to B, which does get interrupted but only due to train failure.
Is it possible to define each trip as its own process i.e. a separate process for sections A_to_B and B_to_C, and separate platforms as pA, pB and pC. Each one with a single resource (to allow only one train on it at a time) and to incorporate random failures and repairs for these section and platform processes? I would also need to perhaps have several sections between two platforms, any of which could experience a failure.
Any help would be greatly appreciated.
Here's my code so far:
import random
import simpy
import numpy
RANDOM_SEED = 1234
T_MEAN_A = 240.0 # mean journey time
T_MEAN_EXPO_A = 1/T_MEAN_A # for exponential distribution
T_MEAN_B = 240.0 # mean journey time
T_MEAN_EXPO_B = 1/T_MEAN_B # for exponential distribution
DWELL_TIME = 30.0 # amount of time train sits at platform for passengers
DWELL_TIME_EXPO = 1/DWELL_TIME
MTTF = 3600.0 # mean time to failure (seconds)
TTF_MEAN = 1/MTTF # for exponential distribution
REPAIR_TIME = 240.0
REPAIR_TIME_EXPO = 1/REPAIR_TIME
NUM_TRAINS = 1
SIM_TIME_DAYS = 100
SIM_TIME = 3600 * 18 * SIM_TIME_DAYS
SIM_TIME_HOURS = SIM_TIME/3600
# Defining the times for processes
def A_B(): # returns processing time for journey A to B
return random.expovariate(T_MEAN_EXPO_A) + random.expovariate(DWELL_TIME_EXPO)
def B_C(): # returns processing time for journey B to C
return random.expovariate(T_MEAN_EXPO_B) + random.expovariate(DWELL_TIME_EXPO)
def time_to_failure(): # returns time until next failure
return random.expovariate(TTF_MEAN)
# Defining the train
class Train(object):
def __init__(self, env, name, repair):
self.env = env
self.name = name
self.trips_complete = 0
self.broken = False
# Start "travelling" and "break_train" processes for the train
self.process = env.process(self.running(repair))
env.process(self.break_train())
def running(self, repair):
while True:
# start trip A_B
done_in = A_B()
while done_in:
try:
# going on the trip
start = self.env.now
yield self.env.timeout(done_in)
done_in = 0 # Set to 0 to exit while loop
except simpy.Interrupt:
self.broken = True
done_in -= self.env.now - start # How much time left?
with repair.request(priority = 1) as req:
yield req
yield self.env.timeout(random.expovariate(REPAIR_TIME_EXPO))
self.broken = False
# Trip is finished
self.trips_complete += 1
# start trip B_C
done_in = B_C()
while done_in:
try:
# going on the trip
start = self.env.now
yield self.env.timeout(done_in)
done_in = 0 # Set to 0 to exit while loop
except simpy.Interrupt:
self.broken = True
done_in -= self.env.now - start # How much time left?
with repair.request(priority = 1) as req:
yield req
yield self.env.timeout(random.expovariate(REPAIR_TIME_EXPO))
self.broken = False
# Trip is finished
self.trips_complete += 1
# Defining the failure
def break_train(self):
while True:
yield self.env.timeout(time_to_failure())
if not self.broken:
# Only break the train if it is currently working
self.process.interrupt()
# Setup and start the simulation
print('Train trip simulator')
random.seed(RANDOM_SEED) # Helps with reproduction
# Create an environment and start setup process
env = simpy.Environment()
repair = simpy.PreemptiveResource(env, capacity = 1)
trains = [Train(env, 'Train %d' % i, repair)
for i in range(NUM_TRAINS)]
# Execute
env.run(until = SIM_TIME)
# Analysis
trips = []
print('Train trips after %s hours of simulation' % SIM_TIME_HOURS)
for train in trains:
print('%s completed %d trips.' % (train.name, train.trips_complete))
trips.append(train.trips_complete)
mean_trips = numpy.mean(trips)
std_trips = numpy.std(trips)
print "mean trips: %d" % mean_trips
print "standard deviation trips: %d" % std_trips
it looks like you are using Python 2, which is a bit unfortunate, because
Python 3.3 and above give you some more flexibility with Python generators. But
your problem should be solveable in Python 2 nonetheless.
you can use sub processes within in a process:
def sub(env):
print('I am a sub process')
yield env.timeout(1)
# return 23 # Only works in py3.3 and above
env.exit(23) # Workaround for older python versions
def main(env):
print('I am the main process')
retval = yield env.process(sub(env))
print('Sub returned', retval)
As you can see, you can use Process instances returned by Environment.process()
like normal events. You can even use return values in your sub proceses.
If you use Python 3.3 or newer, you don’t have to explicitly start a new
sub-process but can use sub() as a sub routine instead and just forward the
events it yields:
def sub(env):
print('I am a sub routine')
yield env.timeout(1)
return 23
def main(env):
print('I am the main process')
retval = yield from sub(env)
print('Sub returned', retval)
You may also be able to model signals as resources that may either be used
by failure process or by a train. If the failure process requests the signal
at first, the train has to wait in front of the signal until the failure
process releases the signal resource. If the train is aleady passing the
signal (and thus has the resource), the signal cannot break. I don’t think
that’s a problem be cause the train can’t stop anyway. If it should be
a problem, just use a PreemptiveResource.
I hope this helps. Please feel welcome to join our mailing list for more
discussions.

Simpy subway simulation: how to fix interrupt failure of class train while queueing for a resource?

I am working on a train simulation in simpy and have had success so far with a single train entity following the code below.
The trains processes are sections followed by platforms. Each section and platform has a resource of 1 to ensure that only one train utilises at a time.
However I can't find a way to get around the error below:
When I add in a second train to the simulation there is occasionally the situation where one train waits for an unavailable resource and then a failure occurs on that train while it is waiting.
I end up with an Interrupt: Interrupt() error.
Is there a way around these failing queues for resources?
Any help is much appreciated.
import random
import simpy
import numpy
# Configure parameters for the model
RANDOM_SEED = random.seed() # makes results repeatable
T_MEAN_SECTION = 200.0 # journey time (seconds)
DWELL_TIME = 30.0 # dwell time mean (seconds)
DWELL_TIME_EXPO = 1/DWELL_TIME # for exponential distribution
MTTF = 600.0 # mean time to failure (seconds)
TTF_MEAN = 1/MTTF # for exponential distribution
REPAIR_TIME = 120.0 # mean repair time for when failure occurs (seconds)
REPAIR_TIME_EXPO = 1/REPAIR_TIME # for exponential distribution
NUM_TRAINS = 2 # number of trains to simulate
SIM_TIME_HOURS = 1 # sim time in hours
SIM_TIME_DAYS = SIM_TIME_HOURS/18.0 # number of days to simulate
SIM_TIME = 3600 * 18 * SIM_TIME_DAYS # sim time in seconds (this is used in the code below)
# Defining the times for processes
def Section(): # returns processing time for platform 7 Waterloo to 26 Bank
return T_MEAN_SECTION
def Dwell(): # returns processing time for platform 25 Bank to platform 7 Waterloo
return random.expovariate(DWELL_TIME_EXPO)
def time_to_failure(): # returns time until next failure
return random.expovariate(TTF_MEAN)
# Defining the train
class Train(object):
def __init__(self, env, name, repair):
self.env = env
self.name = name
self.trips_complete = 0
self.num_saf = 0
self.sum_saf = 0
self.broken = False
# Start "running" and "downtime_train" processes for the train
self.process = env.process(self.running(repair))
env.process(self.downtime_train())
def running(self, repair):
while True:
# request section A
request_SA = sectionA.request()
########## SIM ERROR IF FAILURE OCCURS HERE ###########
yield request_SA
done_in_SA = Section()
while done_in_SA:
try:
# going on the trip
start = self.env.now
print('%s leaving platform at time %d') % (self.name, env.now)
# processing time
yield self.env.timeout(done_in_SA)
# releasing the section resource
sectionA.release(request_SA)
done_in_SA = 0 # Set to 0 to exit while loop
except simpy.Interrupt:
self.broken = True
delay = random.expovariate(REPAIR_TIME_EXPO)
print('Oh no! Something has caused a delay of %d seconds to %s at time %d') % (delay, self.name, env.now)
done_in_SA -= self.env.now - start # How much time left?
with repair.request(priority = 1) as request_D_SA:
yield request_D_SA
yield self.env.timeout(delay)
self.broken = False
print('Okay all good now, failure fixed on %s at time %d') % (self.name, env.now)
self.num_saf += 1
self.sum_saf += delay
# request platform A
request_PA = platformA.request()
########## SIM ERROR IF FAILURE OCCURS HERE ###########
yield request_PA
done_in_PA = Dwell()
while done_in_PA:
try:
# platform process
start = self.env.now
print('%s arriving to platform A and opening doors at time %d') % (self.name, env.now)
yield self.env.timeout(done_in_PA)
print('%s closing doors, ready to depart platform A at %d\n') % (self.name, env.now)
# releasing the platform resource
platformA.release(request_PA)
done_in_PA = 0 # Set to 0 to exit while loop
except simpy.Interrupt:
self.broken = True
delay = random.expovariate(REPAIR_TIME_EXPO)
print('Oh no! Something has caused a delay of %d seconds to %s at time %d') % (delay, self.name, env.now)
done_in_PA -= self.env.now - start # How much time left?
with repair.request(priority = 1) as request_D_PA:
yield request_D_PA
yield self.env.timeout(delay)
self.broken = False
print('Okay all good now, failure fixed on %s at time %d') % (self.name, env.now)
self.num_saf += 1
self.sum_saf += delay
# Round trip is finished
self.trips_complete += 1
# Defining the failure event
def downtime_train(self):
while True:
yield self.env.timeout(time_to_failure())
if not self.broken:
# Only break the train if it is currently working
self.process.interrupt()
# Setup and start the simulation
print('Train trip simulator')
random.seed(RANDOM_SEED) # Helps with reproduction
# Create an environment and start setup process
env = simpy.Environment()
# Defining resources
platformA = simpy.Resource(env, capacity = 1)
sectionA = simpy.Resource(env, capacity = 1)
repair = simpy.PreemptiveResource(env, capacity = 10)
trains = [Train(env, 'Train %d' % i, repair)
for i in range(NUM_TRAINS)]
# Execute
env.run(until = SIM_TIME)
Your processes request a resource and never release it. That’s why the second trains waits forever for its request to succeed. While it is waiting, the failure process seems to interrupt the process. That’s why you get an error. Please read the guide to resources to understand how SimPy’s resources work and, especially, how to release a resource when you are done.

Playframework parallel rendering

I want to process several tasks in parallel inside a Action, and push back any task result in first-completed order and as soon as it completes.
For example, if task A completes in 5 secs, task B completes in 3 secs and task C completes in 1 sec, the output should be "C", "B", "A".
The following codes seems output the wrong order and await all the task completes before output the result.
def lookup = Action { implicit req =>
val a = Enumerator( Await.result(Promise.timeout("A", 5 seconds), 1 minute))
val b = Enumerator( Await.result(Promise.timeout("B", 3 seconds), 1 minute))
val c = Enumerator( Await.result(Promise.timeout("C", 1 second), 1 minute))
val d = a >- b >- c
Ok.chunked(d &> Comet(callback = "console.log"))
}
Your code is broken because of how you are using Await.result. The line that defines a doesn't complete until Await.result returns, and so the promise for b never starts until after the one for a has finished. If you use something like:
val a = Enumerator.flatten(Future.firstCompletedOf(List(
Promise.timeout("A", 5 seconds),
Promise.timeout(throw new Exception("A timed out"), 1 minute)
)).map(Enumerator(_)))
You will get correct behavior.