Silence or condense IPython parallel exceptions

Silence or condense IPython parallel exceptions - ipython

Is it possible to silence the details of a composite exception containing the errors IPython parallel workers? I have a large cluster (500+ workers) and if my (bad) code throws an exception on all workers, it takes forever for the exception to parse and render in the IPython Notebook. I'd like to just silence the details of the worker errors and get one, simple tiny exception back with the details from a single worker since the rest tend to be the same in my usage.
I know I can switch my DirectView to point to one worker to test my code, but I'd be handy not to manipulate the dview and instead just set a global flag to avoid giant stack traces.

Step 1: ask this question
Step 2: checkout this Pull Request
If you just want to see the first exception, you can register a custom exception handler that does exactly that:
from IPython.parallel import error
def only_the_first(self, etype, value, tb, tb_offset=None):
value.print_traceback(0)
ip = get_ipython()
ip.set_custom_exc((error.CompositeError, ), only_the_first)

Related

running Karma in a loop and programmatic access

It's a question more about the architecture of a program that runs karma in a CI pipeline.
I have a set of web components. They are using karma to run tests (following open-wc.org recommendations). Then I have my custom CI pipeline that allows to schedule a test of selected group of components.
When the test is scheduled it execute tests for each component one by one. However in my logs I am getting messages like
MaxListenersExceededWarning: Possible EventEmitter memory leak
detected. 12 exit listeners added to [process]. Use
emitter.setMaxListeners() to increase limit
or sometimes
listen EADDRINUSE: address already in use 127.0.0.1:9877
which breaks the test (exists the process).
I can't really pinpoint the problem so I am guessing that I am not running the test in a correct way.
On the server I am using Server class to initialize the server, then I am calling start on the server. When the callback function passed to Server constructor is called I am assuming the server is stopped and I can start over with another component. But clearly it is not the case per errors I am getting.
So the question is what would be the right way of running Karma test in a loop, one by one, using node API instead of CLI.
Update
To be specific of how I am running the tests.
I am:
Creating configuration by calling config.parseConfig where the argument is component's karma config file
Calling new Server(opts, (code) => {}) where opts are the one generated in step 1
Adding listeners for browser_complete and browser_error to generate a report and to store it into the data store
Cleaning up (removing reference for the server) when constructor callback is called
Getting next component from the queue and going back to #1

To answer my on question,
I have moved the whole logic of executing a single test to a child process and after the test finishes, but before the next test is run, I am making sure the child process is killed. No more error messages are showing up.

Linux, warning : __get_request: dev 8:0: request aux data allocation failed, iosched may be disturbed

I playing with the test code to submit BIO from my own kernel module:
if i use submit_bio(&bio) - all works fine
if i use bdev->bd_queue->make_request_fn(bdev->bd_queue, &bio) then
getting in the dmesg:
__get_request: dev 8:0: request aux data allocation failed, iosched may be disturbed
My primary target is submiting BIOs to stackable device driver w/o calling the submit_bio() routine. Any ideas, pointers ?

Our hero Tom Caputi of ZFS encryption fame figured this out.
Basically the scheduler expects an io context in the task struct for the thread that's running your request.
You'll see here, the io context is created in generic_make_request_checks()
https://elixir.bootlin.com/linux/latest/source/block/blk-core.c#L2323
If it is never created for the task struct that's running your request, you'll see that message "io sched may be disturbed." A lousy message if ever there was one. "Scheduler context was not allocated for this task" would have made the problem a bit more obvious.
I'm not the kernel guy that tom is, but basically by doing this:
bdev->bd_queue->make_request_fn
your request is being handled by another thread, that doesn't have that context allocated.
Now create_io_context is not exported so you can't call it directly.
But if you call this:
https://elixir.bootlin.com/linux/latest/source/block/blk-ioc.c#L319
which is exported the io context will be allocated an no more warning message.
And I imagine there will be some io improvement because the scheduler has context to work with.

Scala system process hangs

I have an actor that uses ProcessBuilder to execute an external process:
def act {
while (true) {
receive {
case param: String => {
val filePaths = Seq("/tmp/file1","/tmp/file2")
val fileList = new ByteArrayInputStream(filePaths.mkString("\n").getBytes())
val output = s"myExecutable.sh ${param}" #< fileList !!<
doSomethingWith(output)
}
}
}
}
I run hundreds this actors running in parallel. Sometimes, for an unknown reason, the execution of the process (!!) never returns. It hangs forever. This specific actor cannot handle new messages. Is there any way to setup a timeout for this process to return, and if it exceeds retry?
What could be the reason for these executions to hold forever? Because these commands are not supposed to last more than a few milliseconds.
Edit 1:
Two important facts that I observed:
This problem does not occur on Max OS X, only in Linux
When I don't use ByteArrayInputStream as input for the execution, the program does not hang

I have an actor that uses ProcessBuilder to execute an external process: ... I run hundreds this actors running in parallel ...
That's some very heavy processing happening in parallel just to achieve a few millisecs of work in each case. Concurrent processing mechanisms rank as follows (from worst to best in terms of resource-usage, scalability and performance):
process = heavy-weight
thread = medium-weight (dozens of threads can execute within a single process space)
actor = light-weight (dozens of actors can execute by leveraging a single shared thread or multiple shared threads)
Concurrently spawning many processes takes significant operating system resources - for process creation and termination. In extreme cases, the O/S overhead to start & end processes could consume hundreds or thousands more CPU and memory resources than the actual job execution. That's why the thread-model was created (and the more efficient actor model). Think of your current processing as doing 'CGI-like' non-scalable O/S-stressing-processing from within your extremely-scalable actors - that's an anti-pattern. It doesn't take much to stress some operating systems to the point of breakage: this could be happening.
Also, if the files being read are very large in size, it would be best for scalability and reliability to limit the number of processes that concurrently read files on the same disk. It might be OK for up to 10 processes to read concurrently, I doubt it would be OK for 100.
How should an Actor invoke an external program?
Of course, if you converted your logic in myExecutable.sh into Scala, you would not need to create processes at all. Achieving scalability, performance and reliability would be more straightforward.
Assuming this is not possible/desirable, you should limit the total number of processes created and you should reuse them across different Actors / requests over time.
First solution option: (1) create a pool of processes that are reused (say size 10) (2) create actors (say 100) that communicate to/from the processes via ProcessIO (3) if all processes are busy with processing, then it is OK/appropriate that Actors block until one becomes available. The issue with this option: complexity; the 100 actors must do work to interact with the process pool and the actors themselves add little value when the processes are the bottle-neck.
Better solution option: (1) create a limited number of actors (say 10) (2) have each actor create 1 private long-running process (i.e. no pool as such) (3) have each actor communicate to/from via ProcessIO, blocking if the process is busy. Issue: still not as simple as possible; actors interact poorly with blocking processes.
Best solution option: (1) no actors, a simple for-loop from your main thread will achieve the same benefits as actors (2) create a limited number of processes (10) (3) via for-loop, sequentially interact each process using ProcessIO (if busy - block or skip to next iteration)
Is there any way to setup a timeout for this process to return, and if it exceeds retry?
Indeed there is. One of the most powerful features of actors is the ability for some actors to spawn other actors and to act as supervisor of them (receiving failure or timeout messages, from which they can recover/restart). With 'native scala actors' this is done via rudimentary programming, generating your own checks and timeout messages. But I won't cover that because the Akka approaches are more powerful and simpler. Plus the next major Scala release (2.11) will use Akka as the supported actor model, with 'native scala actors' deprecated.
Here's an example Akka supervising actor with programmatic timeout/restart (not compiled/tested). Of course, this is not useful if you go with the 3rd solution option):
import scala.concurrent.duration._
import scala.collection.immutable.Set
class Supervisor extends Actor {
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
case _: ArithmeticException => Resume // resumes (reuses) all child actors
case _: NullPointerException => Restart // restarts all child actors
case _: IllegalArgumentException => Stop // terminates this actor & all children
case _: Exception => Escalate // supervisor to receive exception
}
val worker = context.actorOf(Props[Worker]) // creates a supervised child actor
var pendingRequests = Set.empty[WorkerRequest]
def receive = {
case req: WorkRequest(sender, jobReq) =>
pendingRequests = pendingRequests + req
worker ! req
system.scheduler.scheduleOnce(10 seconds, self, WorkTimeout(req))
case resp: WorkResponse(req # WorkRequest(sender, jobReq), jobResp) =>
pendingRequests = pendingRequests - req
sender ! resp
case timeout: WorkTimeout(req) =>
if (pendingRequests get req != None) {
// restart the unresponsive worker
worker restart
// resend all pending requests
pendingRequests foreach{ worker ! _ }
}
}
}
A word of caution: this approach to actor supervision will not overcome poor architecture & design. If you start with suitable process/thread/actor design to meet your requirements, then supervision will promote reliability. But if you start with poor design, then there's a risk that using 'brute-force' recovery from O/S-level failures could exacerbate your problems - making process reliability worse or even causing the machine to crash.

I don't have enough info to reproduce the issue, so I can't diagnose it exactly, but here's how I'd go about diagnosing it if I were in your shoes. The basic approach is a differential diagnosis - identify possible causes, and tests that would prove or rule them out.
The first thing I'd do is to validate that the myExecutable.sh process spawned by the application is actually terminating.
If the process isn't terminating, then this is part of the problem, so we need to understand why. One thing we could do is to run something other than myExecutable.sh. You suggested that ByteArrayInputStream may be part of the problem, which suggests that myExecutable.sh is getting bad input on stdin. If that's the case, then you could instead run a script that simply logs its input to a file, which would show this. If the input is invalid, then ByteArrayInputStream is providing bad data for some reason - thread safety and unicode are the obvious culprits, but looking at the actual bad data should give you a clue. If the input is valid, then it's a bug in myExecutable.sh.
If the process is terminating, then the problem is somewhere else. My first guesses would be that it's either related to actor scheduling (actor libraries typically use ForkJoin for execution, which is great, but doesn't deal well with blocking code), or a bug in the scala.sys.process library (wouldn't be unprecedented - I had to drop scala.sys.process from a project I was working on because of a memory leak).
Looking at the stack trace for a hung thread should give you some clues (VisualVM is your friend), as you should be able to see what's waiting. You can then find the relevant code in the OpenJDK or Scala standard library source code. Where you go from there depends on what you find.

Can you not fire off this process and its handling in a future and use a timed wait against it?

I don't think we can figure it out witout knowing myExecutable.sh or doSomethingWith.
When it hangs, try killing all the myExecutable.sh processes.
If it helps, you should inspect the myExecutable.sh.
If it does not help, you should inspect the doSomethingWith function.

In Scala, does Futures.awaitAll terminate the thread on timeout?

So I'm writing a mini timeout library in scala, it looks very similar to the code here: How do I get hold of exceptions thrown in a Scala Future?
The function I execute is either going to complete successfully, or block forever, so I need to make sure that on a timeout the executing thread is cancelled.
Thus my question is: On a timeout, does awaitAll terminate the underlying actor, or just let it keep running forever?
One alternative that I'm considering is to use the java Future library to do this as there is an explicit cancel() method one can call.

[Disclaimer - I'm new to Scala actors myself]
As I read it, scala.actors.Futures.awaitAll waits until the list of futures are all resolved OR until the timeout. It will not Future.cancel, Thread.interrupt, or otherwise attempt to terminate a Future; you get to come back later and wait some more.
The Future.cancel may be suitable, however be aware that your code may need to participate in effecting the cancel operation - it doesn't necessarily come for free. Future.cancel cancels a task that is scheduled, but not yet started. It interrupts a running thread [setting a flag that can be checked]... which may or may not acknowledge the interrupt. Review Thread.interrupt and Thread.isInterrupted(). Your long-running task would normally check to see if it's being interrupted (your code), and self-terminate. Various methods (i.e. Thread.sleep, Object.wait and others) respond to the interrupt by throwing InterruptedException. You need to review & understand that mechanism to ensure your code will meet your needs within those constraints. See this.

How to halt the invocation of the mapper or reducer

I am trying to run my hadoop map/reduce job inside eclipse (not a node and or cluster) to debug my map/reduce logic. I want to able to put a break point on the mapper and reducer and make eclipse to stop on these break points however this is not happening and the things mapper get stuck. I noticed that if I hit suspend and run a couple of times, it will eventually break on the mapper and reducer. I am very new to eclipse. What am doing wrong?
I am literally running the word count code at http://wiki.apache.org/hadoop/WordCount and have break points on lines 22, 35.

Maybe you have disabled break points? The break points will be displayed with a strike through icon if that is the case.
When not running locally it is possible that your break points will not be hit, because the tasks are run in new isolated JVMs. However that does not seem to be the case here, because suspend would not work either in that case.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse