What is the way(s) of implementing parallel execution with concurrent processes limit in terms of Aff? I believe there is no method in std libs and didn't find a good full answer on this.
parSequenceWithLmit :: Array (Aff X) -> Int -> Aff (Array X)
Aff X calcs should be made in parallel, but not more the given N concurrent calcs. So it starts N cals, when one is accomplished, the next one (of the left) is started.
For this sort of thing a good mechanism is AVar, which is a blocking mutable cell. It can be conceptually thought of as a one-element blocking queue.
First, an AVar may be either empty or full. You can create an empty one with empty, and then you can "fill" it with a value using put. The useful bit here is that, when you call put and the AVar is already "full", put will block until it's empty again.
Second, you can read the value using take, which will return you the value, but leave the AVar empty at the same time. Similarly to put, if the AVar is empty, take will block until it's full.
So what you can do with it is the following:
Create a single AVar.
Fork off N processes, each of which will take a value from that AVar and process it, then loop. Forever.
Have an orchestrator process, which will iterate over the whole sequence of work and put work items into the AVar.
When all work processes are busy, the orchestrator process will push another value into the AVar, and then will try to push the next one, but will become blocked at this point, because AVar is already full. It will remain blocked until one of the work processes finishes its work and calls take to get the next work item, leaving the AVar empty. This will unblock the orchestrator process, which will immediately push the next work item into AVar, and so on.
The missing bit here is how to stop. If the work processes just do an infinite loop, they will never quit. When the orchestrator process eventually runs out of work and stops filling the AVar, the work processes will just block forever on the take calls. Not good.
So to fight this, have two kinds of work items - (1) actual work and (2) command to stop processing. Then have the orchestrator process first push all the work items, and once that is done, push N commands to stop. Optionally you can push N+1 commands to stop: this will guarantee that the orchestrator process blocks until the last worker has finished.
Putting all of this together, here's a demo program:
module Main where
import Prelude
import Data.Array ((..))
import Data.Foldable (for_)
import Data.Int (toNumber)
import Effect (Effect)
import Effect.AVar (AVar)
import Effect.Aff (Aff, Milliseconds(..), delay, forkAff, launchAff_)
import Effect.Aff.AVar as AVar
import Effect.Class (liftEffect)
import Effect.Console (log)
data Work a = Work a | Done
process :: Int -> AVar (Work Int) -> Aff Unit
process myIndex v = do
w <- AVar.take v
case w of
Done ->
pure unit
Work i -> do
liftEffect $ log $ "Worker " <> show myIndex <> ": Processing " <> show i
delay $ Milliseconds $ toNumber i
liftEffect $ log $ "Worker " <> show myIndex <> ": Processed " <> show i
process myIndex v
main :: Effect Unit
main = launchAff_ do
var <- AVar.empty
for_ (1..5) \idx -> forkAff $ process idx var
let inputs = [100,200,300,300,400,1000,2000,101,102,103,104]
for_ inputs \i -> AVar.put (Work i) var
for_ (1..6) \_ -> AVar.put Done var
In this program my work items are just numbers, which signify the number of milliseconds to sleep. I'm using this as a model of how "expensive" each work item is to process. The program output will be something like this:
Worker 1: Processing 100
Worker 2: Processing 200
Worker 3: Processing 300
Worker 4: Processing 300
Worker 5: Processing 400
Worker 1: Processed 100
Worker 1: Processing 1000
Worker 2: Processed 200
Worker 2: Processing 2000
Worker 3: Processed 300
Worker 3: Processing 101
Worker 4: Processed 300
Worker 4: Processing 102
Worker 5: Processed 400
Worker 5: Processing 103
Worker 3: Processed 101
Worker 3: Processing 104
Worker 4: Processed 102
Worker 5: Processed 103
Worker 3: Processed 104
Worker 1: Processed 1000
Worker 2: Processed 2000
Related
From the documentation for blockLast():
Subscribe to this Flux and block indefinitely until the upstream signals its last value or completes. Returns that value, or null if the Flux completes empty. In case the Flux errors, the original exception is thrown (wrapped in a RuntimeException if it was a checked exception).
Let's say for an example code sample:
Flux
.range(0, 1000)
.doOnNext(i -> System.out.println("i = " + i + "Thread: " + Thread.currentThread().getName()))
.flatMap(i -> {
System.out.println("end"+ i + " Thread: " + Thread.currentThread().getName());
return Mono.just(i);
}).blockLast();
If I were to understand this based off the documentation's own description, I'd think blockLast means to block the publisher (in this case till all 1000 integers are emitted successfully, last one included).
After which .flatMap(..) is called, one at a time (since we don't specifically force parallel processing.
However I see the following in the console when run:
i = 0Thread: main
end0 Thread: main
i = 1Thread: main
end1 Thread: main
i = 2Thread: main
end2 Thread: main
i = 3Thread: main
end3 Thread: main
i = 4Thread: main
end4 Thread: main
i = 5Thread: main
Isnt i = 0Thread: main supposed to run till i = 1000Thread: main first then .flatMap gets executed?
i.e.
i = 0Thread: main
i = 1Thread: main
i = 2Thread: main
i = 3Thread: main
i = 4Thread: main
.
.
end1 Thread: main
end2 Thread: main
end3 Thread: main
The behavior is exactly the same if .subscribe() is used. I'm kinda confused here.
The observed behaviour is fine. A Flux describes a sequence of operations that are executed as elements are emitted.
So, in your example, each integer generated by range is immediately processed by the next operation in chain, i.e. flatMap here.
It is the same behaviour as with standard java.util.stream.Stream API.
The reason for that behaviour is double:
Avoid buffering all elements between each processing step
A data source can emit an infinite number of messages. And it can also emit messages with various frequency (with constant delay, or not, very fast or very slow, etc.). So, a stream API is designed to process and return each element as soon as it is received, independently of the messages before or after it.
And about blockLast specifically: internally, it subscribe to the flux, and wait for completion or error signal to return or throw an error to the user.
I'm currently performing computation of the factorial of 10 random numbers using dispy, which "distributes" the tasks to various nodes.
However, if one of the computation is of the factorial of a large number let's say factorial(100), then if the that task takes a very long time, yet dispy runs it only on a single node.
How do I make sure that dispy breaks down and distributes this task to other nodes, so that it doesn't take so much time?
Here's the code that I have come up with so far, where the factorial of 10 random numbers is calculated and the 5th computation is always of factorial(100) :-
# 'compute' is distributed to each node running 'dispynode'
def compute(n):
import time, socket
ans = 1
for i in range(1,n+1):
ans = ans * i
time.sleep(n)
host = socket.gethostname()
return (host, n,ans)
if __name__ == '__main__':
import dispy, random
cluster = dispy.JobCluster(compute)
jobs = []
for i in range(10):
# schedule execution of 'compute' on a node (running 'dispynode')
# with a parameter (random number in this case)
if(i==5):
job = cluster.submit(100)
else:
job = cluster.submit(random.randint(5,20))
job.id = i # optionally associate an ID to job (if needed later)
jobs.append(job)
# cluster.wait() # waits for all scheduled jobs to finish
for job in jobs:
host, n, ans = job() # waits for job to finish and returns results
print('%s executed job %s at %s with %s as input and %s as output' % (host, job.id, job.start_time, n,ans))
# other fields of 'job' that may be useful:
# print(job.stdout, job.stderr, job.exception, job.ip_addr, job.start_time, job.end_time)
cluster.print_status()
Dispy distributes the tasks as you define them - it doesn't make the tasks more granular for you.
You could create your own logic for granulating the tasks first. That's probably pretty easy to do for a factorial. however I wonder if in your case the performance problem is due to this line:
time.sleep(n)
For factorial(100), why do you want to sleep 100 seconds?
// 1 fixed thread
implicit val waitingCtx = scala.concurrent.ExecutionContext.fromExecutor(Executors.newFixedThreadPool(1))
// "map" will use waitingCtx
val ss = (1 to 1000).map {n => // if I change it to 10 000 program will be stopped at some point, like locking forever
service1.doServiceStuff(s"service ${n}").map{s =>
service1.doServiceStuff(s"service2 ${n}")
}
}
Each doServiceStuff(name:String) takes 5 seconds. doServiceStuff does not have implicit ex:Execution context as parameter, it uses its own ex context inside and does Future {blocking { .. }} on it.
In the end program prints:
took: 5.775849753 seconds for 1000 x 2 stuffs
If I change 1000 to 10000 in, adding even more tasks : val ss = (1 to 10000) then program stops:
~17 027 lines will be printed (out of 20 000). No "ERROR" message
will be printed. No "took" message will be printed
**And will not be processing any futher.
But if I change exContext to ExecutionContext.fromExecutor(null: Executor) (global one) then in ends in about 10 seconds (but not normally).
~17249 lines printed
ERROR: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
took: 10.646309398 seconds
That's the question
: Why with fixed ex-context pool it stops without messaging, but with global ex-context it terminates but with error and messaging?
and sometimes.. it is not reproducable.
UPDATE: I do see "ERROR" and "took" if I increase pool from 1 to N. Does not matter how hight N is - it sill will be the ERROR.
The code is here: https://github.com/Sergey80/scala-samples/tree/master/src/main/scala/concurrency/apptmpl
and here, doManagerStuff2()
I think I have an idea of what's going on. If you squint enough, you'll see that map duty is extremely lightweight: just fire off a new future (because doServiceStuff is a Future). I bet the behavior will change if you switch to flatMap, which will actually flatten the nested future and thus will wait for second doServiceStuff call to complete.
Since you're not flattening out these futures, all your awaits downstream are awaiting on a wrong thing, and you are not catching it because here you're discarding whatever Service returns.
Update
Ok, I misinterpreted your question, although I still think that that nested Future is a bug.
When I try your code with both executors with 10000 task I do get OutOfMemory when creating threads in ForkJoin execution context (i.e. for service tasks), which I'd expect. Did you use any specific memory settings?
With 1000 tasks they both do complete successfully.
If I write a celery task that calls other celery tasks, can I release the parent task/worker without waiting for the downstream tasks to finish?
The situation:
I am working with an API that returns some data and the arguments for the next API call. I want to put all the data behind the API into a database. My current method is to query the API for the batch to work on, start some downstream processors, then recursively re-call the API+processing chain. I fear this will lock up workers waiting for all the recursive API calls to finish, when the workers do not care about the results of their children.
Pseudocode:
#task
def apiPing(start=None):
""" Returns a dict of 5 elements, starting at the *start* element, or the
beginning of the list if start is not specified. Also present in the dict is 'remaining',
indicating how many elements are left in the API's list"""
return json.loads(api(start))
#task
def processList(data)
""" Takes a result from API ping, starts a task to store each element and a
chain to recall the API and process that."""
for element in data:
store(element).delay()
if data['remaining']!=0:
chain = chain(apiPing.s(data['last']), processList.s())
chain.delay()
I understand from here that the above is very close to bad; I do not want workers handling processList() to be locked up until all of the data in the API is handled. Is there a way to start the downstream tasks and release the parent worker, or refactor the above to not lock up workers?
Testing reveals that workers are in fact locked this way:
from celery import task
from time import sleep
#task
def parent():
print "In parent"
child.apply_async()
print "Out of parent"
#task
def child():
print "In child"
sleep(10)
print "Out of child"
[2013-08-05 18:37:29,264: WARNING/PoolWorker-4] In parent
[2013-08-05 18:37:31,278: WARNING/PoolWorker-2] In child
[2013-08-05 18:37:41,285: WARNING/PoolWorker-2] Out of child
[2013-08-05 18:37:41,298: WARNING/PoolWorker-4] Out of parent
from random import randrange
from time import sleep
#import thread
from threading import Thread
from Queue import Queue
'''The idea is that there is a Seeker method that would search a location
for task, I have no idea how many task there will be, could be 1 could be 100.
Each task needs to be put into a thread, does its thing and finishes. I have
stripped down a lot of what this is really suppose to do just to focus on the
correct queuing and threading aspect of the program. The locking was just
me experimenting with locking'''
class Runner(Thread):
current_queue_size = 0
def __init__(self, queue):
self.queue = queue
data = queue.get()
self.ID = data[0]
self.timer = data[1]
#self.lock = data[2]
Runner.current_queue_size += 1
Thread.__init__(self)
def run(self):
#self.lock.acquire()
print "running {ID}, will run for: {t} seconds.".format(ID = self.ID,
t = self.timer)
print "Queue size: {s}".format(s = Runner.current_queue_size)
sleep(self.timer)
Runner.current_queue_size -= 1
print "{ID} done, terminating, ran for {t}".format(ID = self.ID,
t = self.timer)
print "Queue size: {s}".format(s = Runner.current_queue_size)
#self.lock.release()
sleep(1)
self.queue.task_done()
def seeker():
'''Gathers data that would need to enter its own thread.
For now it just uses a count and random numbers to assign
both a task ID and a time for each task'''
queue = Queue()
queue_item = {}
count = 1
#lock = thread.allocate_lock()
while (count <= 40):
random_number = randrange(1,350)
queue_item[count] = random_number
print "{count} dict ID {key}: value {val}".format(count = count, key = random_number,
val = random_number)
count += 1
for n in queue_item:
#queue.put((n,queue_item[n],lock))
queue.put((n,queue_item[n]))
'''I assume it is OK to put a tulip in and pull it out later'''
worker = Runner(queue)
worker.setDaemon(True)
worker.start()
worker.join()
'''Which one of these is necessary and why? The queue object
joining or the thread object'''
#queue.join()
if __name__ == '__main__':
seeker()
I have put most of my questions in the code itself, but to go over the main points (Python2.7):
I want to make sure I am not creating some massive memory leak for myself later.
I have noticed that when I run it at a count of 40 in putty or VNC on
my linuxbox that I don't always get all of the output, but when
I use IDLE and Aptana on windows, I do.
Yes I understand that the point of Queue is to stagger out your
Threads so you are not flooding your system's memory, but the task at
hand are time sensitive so they need to be processed as soon as they
are detected regardless of how many or how little there are; I have
found that when I have Queue I can clearly dictate when a task has
finished as oppose to letting the garbage collector guess.
I still don't know why I am able to get away with using either the
.join() on the thread or queue object.
Tips, tricks, general help.
Thanks for reading.
If I understand you correctly you need a thread to monitor something to see if there are tasks that need to be done. If a task is found you want that to run in parallel with the seeker and other currently running tasks.
If this is the case then I think you might be going about this wrong. Take a look at how the GIL works in Python. I think what you might really want here is multiprocessing.
Take a look at this from the pydocs:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.