Solutions to Pythons Multiprocessing Queue buffer dead lock? How to "get" from multiprocessing Queue when its full and continue multiprocessing? - queue

This question is about multiprocessing with Python and Pythons multiprocessing Queue buffer limitations rendered by my computers OS pipe. Basically, I hit the limitation of Pythons multiprocessing Queues buffer.
Here is the my simple implementation of what i have so far
import os
from multiprocessing import Queue,Lock,Manager
def threaded_results(q,*args):
"""do something"""
q.put(*args)
def main():
manager = Manager()
return_dict = manager.dict()
cpu = os.cpu_count()
q = Queue()
processes = []
for i in range(cpu):
p = Process(target=threaded_results,args=(q,*args))
processes.append(p)
p.start()
for p in processes:
p.join()
results = [q.get() for proc in processes]
I read that i have to empty the queue first before adding back to the queue orchestrated by some thing called a semaphore. I'm considering using my own defined data structure or refactor my design of my code. The question is, are there any conventional solutions to bypass the OS level Queue buffer limitations for storing things in cache memory using Python? How to "get" multiprocessing Queue when its full and continue multiprocessing?

After working with the multiprocessing library for a while, I've found that the simplest way to implement a robust multiprocessing queue is to use multiprocessing.Manager objects. From the docs:
Create a shared queue.Queue object and return a proxy for it.
Rather than allocating a separate thread for flushing data through a pipe, a Manager object creates and manages a standard multithreading queue, which doesn't have to have data flushed through a Pipe (haven't looked through the source code, so I can't say for sure). This means your code can keep chugging away practically indefinitely.
None of this is free, and I've found that the managed queue operates much (almost 20x) slower than a multiprocessing queue in a simple test, though the difference isn't nearly as noticeable when the queue is integrated into a full system, due to other bottlenecks.
Using managed queues can make your IPC far more robust, and it's likely a good idea to take the performance trade-off unless you can find a way to live with the unreliability of a normal multiprocessing queue.

Related

Using libevent together with GCD (libdispatch) in Swift

I'm creating a server side app in Swift 3. I've chosen libevent for implementing networking code because it's cross-platform and doesn't suffer from C10k problem. Libevent implements it's own event loop, but I want to keep CFRunLoop and GCD (DispatchQueue.main.after etc) functional as well, so I need to glue them somehow.
This is what I've came up with:
var terminated = false
DispatchQueue.main.after(when: DispatchTime.now() + 3) {
print("Dispatch works!")
terminated = true
}
while !terminated {
switch event_base_loop(eventBase, EVLOOP_NONBLOCK) { // libevent
case 1:
break // No events were processed
case 0:
print("DEBUG: Libevent processed one or more events")
default: // -1
print("Unhandled error in network backend")
exit(1)
}
RunLoop.current().run(mode: RunLoopMode.defaultRunLoopMode,
before: Date(timeIntervalSinceNow: 0.01))
}
This works, but introduces a latency of 0.01 sec. While RunLoop is sleeping, libevent won't be able to process events. Lowering this timeout increases CPU usage significantly when the app is idle.
I was also considering using only libevent, but third party libs in the project can use dispatch_async internally, so this can be problematic.
Running libevent's loop in a different thread makes synchronization more complex, is this the only way of solving this latency issue?
LINUX UPDATE. The above code does not work on Linux (2016-07-25-a Swift snapshot), RunLoop.current().run exists with an error. Below is a working Linux version reimplemented with a timer and dispatch_main. It suffers from the same latency issue:
let queue = dispatch_get_main_queue()
let timer = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, queue)
let interval = 0.01
let block: () -> () = {
guard !terminated else {
print("Quitting")
exit(0)
}
switch server.loop() {
case 1: break // Just idling
case 0: break //print("Libevent: processed event(s)")
default: // -1
print("Unhandled error in network backend")
exit(1)
}
}
block()
let fireTime = dispatch_time(DISPATCH_TIME_NOW, Int64(interval * Double(NSEC_PER_SEC)))
dispatch_source_set_timer(timer, fireTime, UInt64(interval * Double(NSEC_PER_SEC)), UInt64(NSEC_PER_SEC) / 10)
dispatch_source_set_event_handler(timer, block)
dispatch_resume(timer)
dispatch_main()
A quick search of the Open Source Swift Foundation libraries on GitHub reveals that the support in CFRunLoop is (perhaps obviously) implemented differently on different platforms. This means, in essence, that RunLoop and libevent, with respect to cross-platform-ness, are just different ways to achieve the same thing. I can see the thinking behind the thought that libevent is probably better suited to server implementations, since CFRunLoop didn't grow up with that specific goal, but as far as being cross-platform goes, they're both barking up the same tree.
That said, the underlying synchronization primitives used by RunLoop and libevent are inherently private implementation details and, perhaps more importantly, different between platforms. From the source, it looks like RunLoop uses epoll on Linux, as does libevent, but on macOS/iOS/etc, RunLoop is going to use Mach ports as its fundamental primitive, but libevent looks like it's going to use kqueue. You might, with enough effort, be able to make a hybrid RunLoopSource that ties to a libevent source for a given platform, but this would likely be very fragile, and generally ill-advised, for a couple of reasons: Firstly, it would be based on private implementation details of RunLoop that are not part of the public API, and therefore subject to change at any time without notice. Second, assuming you didn't go through and do this for every platform supported by both Swift and libevent, you would have broken the cross-platform-ness of it, which was one of your stated reasons for going with libevent in the first place.
One additional option you might not have considered would be to use GCD by itself, without RunLoops. Look at the docs for dispatch_main. In a server application, there's (typically) nothing special about a "main thread," so dispatching to the "main queue", should be good enough (if needed at all). You can use dispatch "sources" to manage your connections, etc. I can't personally speak to how dispatch sources scale up to the C10K/C100K/etc. level, but they've seemed pretty lightweight and low-overhead in my experience. I also suspect that using GCD like this would likely be the most idiomatic way to write a server application in Swift. I've written up a small example of a GCD-based TCP echo server as part of another answer here.
If you were bound and determined to use both RunLoop and libevent in the same application, it would, as you guessed, be best to give libevent it's own separate thread, but I don't think it's as complex as you might think. You should be able to dispatch_async from libevent callbacks freely, and similarly marshal replies from GCD managed threads to libevent using libevent's multi-threading mechanisms fairly easily (i.e. either by running with locking on, or by marshaling your calls into libevent as events themselves.) Similarly, third party libraries using GCD should not be an issue even if you chose to use libevent's loop structure. GCD manages its own thread pools and would have no way of stepping on libevent's main loop, etc.
You might also consider architecting your application such that it didn't matter what concurrency and connection handling library you used. Then you could swap out libevent, GCD, CFStreams, etc. (or mix and match) depending on what worked best for a given situation or deployment. Choosing a concurrency approach is important, but ideally you wouldn't couple yourself to it so tightly that you couldn't switch if circumstances called for it.
When you have such an architecture, I'm generally a fan of the approach of using the highest level abstraction that gets the job done, and only driving down to lower level abstractions when specific circumstances require it. In this case, that would probably mean using CFStreams and RunLoops to start, and switching out to "bare" GCD or libevent later, if you hit a wall and also determined (through empirical measurement) that it was the transport layer and not the application layer that was the limiting factor. Very few non-trivial applications actually get to the C10K problem in the transport layer; things tend to have to scale "out" at the application layer first, at least for apps more complicated than basic message passing.

Scala system process hangs

I have an actor that uses ProcessBuilder to execute an external process:
def act {
while (true) {
receive {
case param: String => {
val filePaths = Seq("/tmp/file1","/tmp/file2")
val fileList = new ByteArrayInputStream(filePaths.mkString("\n").getBytes())
val output = s"myExecutable.sh ${param}" #< fileList !!<
doSomethingWith(output)
}
}
}
}
I run hundreds this actors running in parallel. Sometimes, for an unknown reason, the execution of the process (!!) never returns. It hangs forever. This specific actor cannot handle new messages. Is there any way to setup a timeout for this process to return, and if it exceeds retry?
What could be the reason for these executions to hold forever? Because these commands are not supposed to last more than a few milliseconds.
Edit 1:
Two important facts that I observed:
This problem does not occur on Max OS X, only in Linux
When I don't use ByteArrayInputStream as input for the execution, the program does not hang
I have an actor that uses ProcessBuilder to execute an external process: ... I run hundreds this actors running in parallel ...
That's some very heavy processing happening in parallel just to achieve a few millisecs of work in each case. Concurrent processing mechanisms rank as follows (from worst to best in terms of resource-usage, scalability and performance):
process = heavy-weight
thread = medium-weight (dozens of threads can execute within a single process space)
actor = light-weight (dozens of actors can execute by leveraging a single shared thread or multiple shared threads)
Concurrently spawning many processes takes significant operating system resources - for process creation and termination. In extreme cases, the O/S overhead to start & end processes could consume hundreds or thousands more CPU and memory resources than the actual job execution. That's why the thread-model was created (and the more efficient actor model). Think of your current processing as doing 'CGI-like' non-scalable O/S-stressing-processing from within your extremely-scalable actors - that's an anti-pattern. It doesn't take much to stress some operating systems to the point of breakage: this could be happening.
Also, if the files being read are very large in size, it would be best for scalability and reliability to limit the number of processes that concurrently read files on the same disk. It might be OK for up to 10 processes to read concurrently, I doubt it would be OK for 100.
How should an Actor invoke an external program?
Of course, if you converted your logic in myExecutable.sh into Scala, you would not need to create processes at all. Achieving scalability, performance and reliability would be more straightforward.
Assuming this is not possible/desirable, you should limit the total number of processes created and you should reuse them across different Actors / requests over time.
First solution option: (1) create a pool of processes that are reused (say size 10) (2) create actors (say 100) that communicate to/from the processes via ProcessIO (3) if all processes are busy with processing, then it is OK/appropriate that Actors block until one becomes available. The issue with this option: complexity; the 100 actors must do work to interact with the process pool and the actors themselves add little value when the processes are the bottle-neck.
Better solution option: (1) create a limited number of actors (say 10) (2) have each actor create 1 private long-running process (i.e. no pool as such) (3) have each actor communicate to/from via ProcessIO, blocking if the process is busy. Issue: still not as simple as possible; actors interact poorly with blocking processes.
Best solution option: (1) no actors, a simple for-loop from your main thread will achieve the same benefits as actors (2) create a limited number of processes (10) (3) via for-loop, sequentially interact each process using ProcessIO (if busy - block or skip to next iteration)
Is there any way to setup a timeout for this process to return, and if it exceeds retry?
Indeed there is. One of the most powerful features of actors is the ability for some actors to spawn other actors and to act as supervisor of them (receiving failure or timeout messages, from which they can recover/restart). With 'native scala actors' this is done via rudimentary programming, generating your own checks and timeout messages. But I won't cover that because the Akka approaches are more powerful and simpler. Plus the next major Scala release (2.11) will use Akka as the supported actor model, with 'native scala actors' deprecated.
Here's an example Akka supervising actor with programmatic timeout/restart (not compiled/tested). Of course, this is not useful if you go with the 3rd solution option):
import scala.concurrent.duration._
import scala.collection.immutable.Set
class Supervisor extends Actor {
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
case _: ArithmeticException => Resume // resumes (reuses) all child actors
case _: NullPointerException => Restart // restarts all child actors
case _: IllegalArgumentException => Stop // terminates this actor & all children
case _: Exception => Escalate // supervisor to receive exception
}
val worker = context.actorOf(Props[Worker]) // creates a supervised child actor
var pendingRequests = Set.empty[WorkerRequest]
def receive = {
case req: WorkRequest(sender, jobReq) =>
pendingRequests = pendingRequests + req
worker ! req
system.scheduler.scheduleOnce(10 seconds, self, WorkTimeout(req))
case resp: WorkResponse(req # WorkRequest(sender, jobReq), jobResp) =>
pendingRequests = pendingRequests - req
sender ! resp
case timeout: WorkTimeout(req) =>
if (pendingRequests get req != None) {
// restart the unresponsive worker
worker restart
// resend all pending requests
pendingRequests foreach{ worker ! _ }
}
}
}
A word of caution: this approach to actor supervision will not overcome poor architecture & design. If you start with suitable process/thread/actor design to meet your requirements, then supervision will promote reliability. But if you start with poor design, then there's a risk that using 'brute-force' recovery from O/S-level failures could exacerbate your problems - making process reliability worse or even causing the machine to crash.
I don't have enough info to reproduce the issue, so I can't diagnose it exactly, but here's how I'd go about diagnosing it if I were in your shoes. The basic approach is a differential diagnosis - identify possible causes, and tests that would prove or rule them out.
The first thing I'd do is to validate that the myExecutable.sh process spawned by the application is actually terminating.
If the process isn't terminating, then this is part of the problem, so we need to understand why. One thing we could do is to run something other than myExecutable.sh. You suggested that ByteArrayInputStream may be part of the problem, which suggests that myExecutable.sh is getting bad input on stdin. If that's the case, then you could instead run a script that simply logs its input to a file, which would show this. If the input is invalid, then ByteArrayInputStream is providing bad data for some reason - thread safety and unicode are the obvious culprits, but looking at the actual bad data should give you a clue. If the input is valid, then it's a bug in myExecutable.sh.
If the process is terminating, then the problem is somewhere else. My first guesses would be that it's either related to actor scheduling (actor libraries typically use ForkJoin for execution, which is great, but doesn't deal well with blocking code), or a bug in the scala.sys.process library (wouldn't be unprecedented - I had to drop scala.sys.process from a project I was working on because of a memory leak).
Looking at the stack trace for a hung thread should give you some clues (VisualVM is your friend), as you should be able to see what's waiting. You can then find the relevant code in the OpenJDK or Scala standard library source code. Where you go from there depends on what you find.
Can you not fire off this process and its handling in a future and use a timed wait against it?
I don't think we can figure it out witout knowing myExecutable.sh or doSomethingWith.
When it hangs, try killing all the myExecutable.sh processes.
If it helps, you should inspect the myExecutable.sh.
If it does not help, you should inspect the doSomethingWith function.

Is there a way to copy files in a non-blocking way in Scala?

I have checked java.nio.file.Files.copy but that blocks a thread until the copy is done. Are there any libraries that allow one to copy a file in a non-blocking way? I need to perform many of these operations simultaneously and cannot afford to have so many threads blocked.
While I could write something myself using non-blocking streams, I would rather use something tried and tested that would guarantee a correct copy every time (or detect if something went wrong).
Check this: Iterate over lines in a file in parallel (Scala)?
val chunkSize = 128 * 1024
val iterator = Source.fromFile(path).getLines.grouped(chunkSize)
iterator.foreach { lines =>
lines.par.foreach { line => process(line) }
}
Reading (copying) files by chunks in parallel. In this case "par" is used.
So it quite non-blocking in terms / scope of processors (cores).
But you may follow same idea of chunks, for example using Akka/Future/Promises to be even in wider scopes.
You may customize you chunk-size deepening on your performance characteristic, level of system load, etc..
One more link that explains possible way to do read / write data from (property) file in parallel using Akka Actors. This is not quite that you might be want, but it may give an idea.
Idea - you may build your own not-blocking way of reading / copying files.
--
And about your statement "While I could write something myself using non-blocking streams":
I would remind that each OS / File System (FS) may have its own vision about what and where to block. Like Windows blocks a file (write-block at leat) if one thread writes to it. On Linux is is configurable. So if you want to stick to something stable, I would suggest to think it out and go with your own wrapper (over FS) solution based on events, chunks, states.
I have used the Process class, issuing an operating system command to copy the file. Of course, one has to check under which OS the application is running, and issue the appropriate command, but this allows for fast and asynchronous copies.
As Marius rightly mentions in the comments, Scala Process blocks, so I run it wrapped in a Future.
Java 8 Process introduces a function isAlive(). A non-blocking alternative would be to use Java 8 processes and use the scheduler to poll at regular intervals to see if the process has finished. However, I did no need to go to this extent.
Have you checked out the async stuff in scala-io?
http://jesseeichar.github.io/scala-io-doc/0.4.2/index.html#!/core/async%20read%20write

Can the instructions in a thread change during execution? (OS)

I'm currently researching threads in the context of the operating system and I'm unsure if a thread is a set sequence of instructions that can be repeatedly executed or if it is filled and replaced with new instructions by the user or the operating system.
Thanks a bundle!
-Tom
I'm not quite sure what you mean - the compiled instructions for a program are stored in memory and are not changed at runtime (at least for languages which are not JIT-compiled).
A thread is an entirely separate concept from the code itself. A thread gives you the ability to be running at "two places at once" in the code. At a conceptual level, a thread is simply a container for the context that you need at any point in the execution of some code. This means that each thread has a call stack and a set of registers (which are either actually stored in the registers of a processor if the thread is running, or elsewhere if the thread is paused).
Almost all thread libraries work such that a new thread will execute some user-defined function and will then exit. This function can be long-running, just like main() (which is the function executed by the first thread in your process).
If the threads are supported by the OS (ie they are not "green threads"/"fibers") they will exit by calling an OS API which tells the OS it can deallocate any data it has which is associated with that thread.
Sometimes, abstractions are built on top of this mechanism such that a thread or pool of threads will execute a function which simply loops over a queue of tasks to run, but the fundamental mechanism is the same. However, these abstractions are provided by user libraries built on top of the OS threading mechanisms, not by the OS itself.

threads in Dancer

I'm using Dancer 1.31, in a standard configuration (plackup/Starman).
In a request I wished to call a perl function asynchronously, so that the request returns inmmediately. Think of the typical "long running operation" scenario, in which one wants to return a "processing page" with a refresh+redirect.
I (naively?) tried with a thread:
sub myfunc {
sleep 9; # just for testing a slow operation
}
any '/test1' => sub {
my $thr = threads->create('myfunc');
$thr->detach();
return "done" ;
};
I does not work, the server seems to freeze, and the error log does not show anything. I guess manual creation of threads are forbidden inside Dancer? It's an issue with PSGI? Which is the recommended way?
I would stay away from perl threads especially in a web server environment. It will most likely crash your server when you join or detach them.
I usually create a few threads (thread pool) BEFORE initializing other modules and keep them around for the entire life time of the application. Thread::Queue nicely provides communication between the workers and the main thread.
The best asynchronous solution I find in Perl is POE. In Linux I prefer using POE::Wheel::Run to run executables and subroutines asynchronously. It uses fork and has a beautiful interface allowing communication with the child process. (In Windows it's not usable due to thread dependency)
Setting up Dancer and POE inside the same application/script may cause problems and POE's event loop may be blocked. A single worker thread dedicated to POE may come handy, or I would write another server based on POE and just communicate with the Dancer application via sockets.
Threads are definitively iffy with Perl. It might be possible to write some threaded Dancer code, but to be honest I don't think we ever tried it. And considering that Dancer 1's core use simpleton classes, it might also be very tricky.
As Ogla says, there are other ways to implement asynchronous behavior in Dancer. You say that you are using Starman, which is a forking engine. But there is also Twiggy, which is AnyEvent-based. To see how to leverage it to write asynchronous code, have a gander at Dancer::Plugin::Async.