Executing Node application from Scala results in 10 times slower - scala

I have a simple Node application that reads and writes a JSON file.
The following is how the Node application is executed from Scala.
case class ExecResult(exitValue: Int, stdout: String, stderr: String)
def execAsync(cmd: String)(implicit ec: ExecutionContext): Future[ExecResult] = {
val promise = Promise[ExecResult]
val proc = Process(cmd).run(ProcessLogger(...))
promise.tryCompleteWith(Future(proc.exitValue()).map(c => ExecResult(c, stdout.get, stderr.get)))
promise.future
}
The execution time of this takes almost 10 times more than executing directly.
What could be the cause of this slowness?

There's a nontrivial amount of overhead in calling out to run an external process (possibly up to and including spawning a shell (e.g. bash) to run that process).
Additionally, depending on how you're measuring this, you may also be capturing the JVM's startup and warmup phase (assuming we're talking about Scala on the JVM).
Especially if the JSON file being read is small, this overhead may swamp the actual time the Node.js application is running.
I'm not going to ask why you want to run a Node.js application from within Scala, but if you want to do something like that, I'd suggest looking at graalvm which lets you run most/any Node.js application natively in the JVM, including calling into it from e.g. Scala without the overhead of spawning an external process. Depending on the use-case, graalvm may actually be faster than the standard V8-based node implementation.

Related

Use persistent external program for occasional input / output translation in Scala

I'm writing some Scala code that needs to make use of a external command line program for string translation. The external program takes many minutes to start up, then listens for data on stdin (terminated by newline), converts the data, and prints the converted data to stdout (again terminated by newline). It will remain alive forever until it receives a SIGINT.
For simplicity, let's assume the external command runs like this:
$ convert
input1
output2
input2
output2
$
convert, input1, and input2 were all typed by me; output1 and output2 were written by the program to stdout. I typed Control-C at the end to return to the shell.
In my Scala code, I'd like to start up this external program, and keep it running in the background (because it is costly to startup, but cheap to keep running once it's initialized), while providing three methods to the rest of my program with an API like:
def initTranslation(): Unit
def translate(input: String): String
def stopTranslation(): Unit
initTranslation should start up the external program and keep it running in the background.
translate should put the input argument on the stdin of the external program (followed by newline), wait for output (followed by newline), and then return the output.
stopTranslation should send SIGINT to the external program.
I've worked with Java and Scala external process management before, but don't have too much experience with Java pipes, but am not 100% sure how to hook this all up. In particular, I've read that there are subtle gotchas with regards to deadlocks when I/O pipes get hooked up in situations similar to this. I'm sure I'll need some Thread to watch start up and watch over the background process in initTranslation, some piping to send a String to stdin followed by blocking to wait for receiving data and a newline on stdout in translate, then some sort of termination of the external program in stopTranslation.
I'd like to achieve this with as much pure Scala as possible, though I realize that this may require some bits of the Java I/O library. I also do not want to use any third party Scala or Java libraries (anything outside java.*, javax.* or scala.*)
What would these three methods look like?
It turns out that this is quite a bit easier than I first expected. I had been misled by various posts and recommendations (off SO) which had suggested that this would be more complex.
Caveats to this solution:
All Java. Yes, I know I mentioned that I'd rather use the Scala standard library, but this is sufficiently succinct that I think it warrants an answer.
Limited error handling - among other things, if the external program explodes and reports errors to stderr, I'm not handling that. Certainly, that could be added on later.
Usage of var for storage of local variables. Clearly, var is frowned upon for best-practice Scala use, but this example illustrates the object state needed, and you can structure your variables in your own programs as you like.
No thread-safety. If you need thread-safety, because multiple threads might call any of the following methods, use some synchronization constructs (like the synchronized keyword in the translate method) to protect yourself.
Solution:
import java.io.BufferedReader
import java.io.InputStreamReader
import java.lang.Process
import java.lang.ProcessBuilder
var process: Process = _
var outputReader: BufferedReader = _
def initTranslation(): Unit = {
process = new ProcessBuilder("convert").start()
outputReader = new BufferedReader(new InputStreamReader(process.getInputStream()))
}
def translate(input: String): String = {
// write path to external program
process.getOutputStream.write(cryptoPath.getBytes)
process.getOutputStream.write(System.lineSeparator.getBytes)
process.getOutputStream.flush()
// wait for input from program
outputReader.readLine()
}
def stopTranslation(): Unit = {
process.destroy()
}

Scala hit GC overhead when running large external process

I have an external program which generate some data I need. Usually, I redirect its output to a file, then read it from my Scala application, e.g.
app.exe > output.data
Now, I want to integrate the process, so I did
val stream = "app.exe" lineStream
stream foreach { line => doWork(_) }
Unfortunately, I got GC overhead exception after a while. This app.exe may generate very large output files, e.g. over 100MB. So I think during the streaming, Scala has been creating/destroying the line string instance thousands of times, and cause the overhead.
I know I can tune the JVM variables to increase the GC overhead throttling. But I am looking for a way that it doesn't need to create a lot of small line instances.
The problem is probably due to memoization, which is a side effect of foreach-ing over a stream this way. Effectively, you are rooting the whole file in memory.
See lots and lots of info on how to avoid this here: http://blog.dmitryleskov.com/programming/scala/stream-hygiene-i-avoiding-memory-leaks/
Specifically, you are violating rule #1. Try defining your stream as a def, not a val.

Scala system process hangs

I have an actor that uses ProcessBuilder to execute an external process:
def act {
while (true) {
receive {
case param: String => {
val filePaths = Seq("/tmp/file1","/tmp/file2")
val fileList = new ByteArrayInputStream(filePaths.mkString("\n").getBytes())
val output = s"myExecutable.sh ${param}" #< fileList !!<
doSomethingWith(output)
}
}
}
}
I run hundreds this actors running in parallel. Sometimes, for an unknown reason, the execution of the process (!!) never returns. It hangs forever. This specific actor cannot handle new messages. Is there any way to setup a timeout for this process to return, and if it exceeds retry?
What could be the reason for these executions to hold forever? Because these commands are not supposed to last more than a few milliseconds.
Edit 1:
Two important facts that I observed:
This problem does not occur on Max OS X, only in Linux
When I don't use ByteArrayInputStream as input for the execution, the program does not hang
I have an actor that uses ProcessBuilder to execute an external process: ... I run hundreds this actors running in parallel ...
That's some very heavy processing happening in parallel just to achieve a few millisecs of work in each case. Concurrent processing mechanisms rank as follows (from worst to best in terms of resource-usage, scalability and performance):
process = heavy-weight
thread = medium-weight (dozens of threads can execute within a single process space)
actor = light-weight (dozens of actors can execute by leveraging a single shared thread or multiple shared threads)
Concurrently spawning many processes takes significant operating system resources - for process creation and termination. In extreme cases, the O/S overhead to start & end processes could consume hundreds or thousands more CPU and memory resources than the actual job execution. That's why the thread-model was created (and the more efficient actor model). Think of your current processing as doing 'CGI-like' non-scalable O/S-stressing-processing from within your extremely-scalable actors - that's an anti-pattern. It doesn't take much to stress some operating systems to the point of breakage: this could be happening.
Also, if the files being read are very large in size, it would be best for scalability and reliability to limit the number of processes that concurrently read files on the same disk. It might be OK for up to 10 processes to read concurrently, I doubt it would be OK for 100.
How should an Actor invoke an external program?
Of course, if you converted your logic in myExecutable.sh into Scala, you would not need to create processes at all. Achieving scalability, performance and reliability would be more straightforward.
Assuming this is not possible/desirable, you should limit the total number of processes created and you should reuse them across different Actors / requests over time.
First solution option: (1) create a pool of processes that are reused (say size 10) (2) create actors (say 100) that communicate to/from the processes via ProcessIO (3) if all processes are busy with processing, then it is OK/appropriate that Actors block until one becomes available. The issue with this option: complexity; the 100 actors must do work to interact with the process pool and the actors themselves add little value when the processes are the bottle-neck.
Better solution option: (1) create a limited number of actors (say 10) (2) have each actor create 1 private long-running process (i.e. no pool as such) (3) have each actor communicate to/from via ProcessIO, blocking if the process is busy. Issue: still not as simple as possible; actors interact poorly with blocking processes.
Best solution option: (1) no actors, a simple for-loop from your main thread will achieve the same benefits as actors (2) create a limited number of processes (10) (3) via for-loop, sequentially interact each process using ProcessIO (if busy - block or skip to next iteration)
Is there any way to setup a timeout for this process to return, and if it exceeds retry?
Indeed there is. One of the most powerful features of actors is the ability for some actors to spawn other actors and to act as supervisor of them (receiving failure or timeout messages, from which they can recover/restart). With 'native scala actors' this is done via rudimentary programming, generating your own checks and timeout messages. But I won't cover that because the Akka approaches are more powerful and simpler. Plus the next major Scala release (2.11) will use Akka as the supported actor model, with 'native scala actors' deprecated.
Here's an example Akka supervising actor with programmatic timeout/restart (not compiled/tested). Of course, this is not useful if you go with the 3rd solution option):
import scala.concurrent.duration._
import scala.collection.immutable.Set
class Supervisor extends Actor {
override val supervisorStrategy =
OneForOneStrategy(maxNrOfRetries = 10, withinTimeRange = 1 minute) {
case _: ArithmeticException => Resume // resumes (reuses) all child actors
case _: NullPointerException => Restart // restarts all child actors
case _: IllegalArgumentException => Stop // terminates this actor & all children
case _: Exception => Escalate // supervisor to receive exception
}
val worker = context.actorOf(Props[Worker]) // creates a supervised child actor
var pendingRequests = Set.empty[WorkerRequest]
def receive = {
case req: WorkRequest(sender, jobReq) =>
pendingRequests = pendingRequests + req
worker ! req
system.scheduler.scheduleOnce(10 seconds, self, WorkTimeout(req))
case resp: WorkResponse(req # WorkRequest(sender, jobReq), jobResp) =>
pendingRequests = pendingRequests - req
sender ! resp
case timeout: WorkTimeout(req) =>
if (pendingRequests get req != None) {
// restart the unresponsive worker
worker restart
// resend all pending requests
pendingRequests foreach{ worker ! _ }
}
}
}
A word of caution: this approach to actor supervision will not overcome poor architecture & design. If you start with suitable process/thread/actor design to meet your requirements, then supervision will promote reliability. But if you start with poor design, then there's a risk that using 'brute-force' recovery from O/S-level failures could exacerbate your problems - making process reliability worse or even causing the machine to crash.
I don't have enough info to reproduce the issue, so I can't diagnose it exactly, but here's how I'd go about diagnosing it if I were in your shoes. The basic approach is a differential diagnosis - identify possible causes, and tests that would prove or rule them out.
The first thing I'd do is to validate that the myExecutable.sh process spawned by the application is actually terminating.
If the process isn't terminating, then this is part of the problem, so we need to understand why. One thing we could do is to run something other than myExecutable.sh. You suggested that ByteArrayInputStream may be part of the problem, which suggests that myExecutable.sh is getting bad input on stdin. If that's the case, then you could instead run a script that simply logs its input to a file, which would show this. If the input is invalid, then ByteArrayInputStream is providing bad data for some reason - thread safety and unicode are the obvious culprits, but looking at the actual bad data should give you a clue. If the input is valid, then it's a bug in myExecutable.sh.
If the process is terminating, then the problem is somewhere else. My first guesses would be that it's either related to actor scheduling (actor libraries typically use ForkJoin for execution, which is great, but doesn't deal well with blocking code), or a bug in the scala.sys.process library (wouldn't be unprecedented - I had to drop scala.sys.process from a project I was working on because of a memory leak).
Looking at the stack trace for a hung thread should give you some clues (VisualVM is your friend), as you should be able to see what's waiting. You can then find the relevant code in the OpenJDK or Scala standard library source code. Where you go from there depends on what you find.
Can you not fire off this process and its handling in a future and use a timed wait against it?
I don't think we can figure it out witout knowing myExecutable.sh or doSomethingWith.
When it hangs, try killing all the myExecutable.sh processes.
If it helps, you should inspect the myExecutable.sh.
If it does not help, you should inspect the doSomethingWith function.

How to run Akka Future using given SecurityManager?

For an open-source multiplayer programming game written in Scala that loads players' bot code via a plug-in system from .jar files, I'd like to prevent the code of the bots from doing harm on the server system by running them under a restrictive SecurityManager implementation.
The current implementation uses a URLClassLoader to extract a control function factory for each bot from its related plug-in .jar file. The factory is then used to instantiate a bot control function instance for each new round of the game. Then, once per simulation cycle, all bot control functions are invoked concurrently to get the bot's responses to their environment. The concurrent invocation is done using Akka's Future.traverse() with an implicitly provided ActorSystem shared by other concurrently operating components (compile service, web server, rendering):
val future = Future.traverse(bots)(bot => Future { bot.respondTo(state) })
val result = Await.result(future, Duration.Inf)
To restrict potentially malicious code contained in the bot plug-ins from running, it appears that following the paths taken in this StackOverflow question and this one I need to have the bot control functions execute in threads running under an appropriately restrictive SecurityManager implementation.
Now the question: how can I get Akka to process the work currently done in Future.traverse() with actors running in threads that have the desired SecurityManager, while the other Actors in the system, e.g. those running the background compile service, continue to run as they do now, i.e. unrestricted?
You can construct an instance of ExecutionContext (eg. via ExecutionContext.fromExecutorService) that runs all work under the restrictive security manager, and bring it into the implicit scope for Future.apply and Future.traverse.
If the invoked functions do not need to interact with the environment, I don't think you need a separate ActorSystem.

hosting simple python scripts in a container to handle concurrency, configuration, caching, etc

My first real-world Python project is to write a simple framework (or re-use/adapt an existing one) which can wrap small python scripts (which are used to gather custom data for a monitoring tool) with a "container" to handle boilerplate tasks like:
fetching a script's configuration from a file (and keeping that info up to date if the file changes and handle decryption of sensitive config data)
running multiple instances of the same script in different threads instead of spinning up a new process for each one
expose an API for caching expensive data and storing persistent state from one script invocation to the next
Today, script authors must handle the issues above, which usually means that most script authors don't handle them correctly, causing bugs and performance problems. In addition to avoiding bugs, we want a solution which lowers the bar to create and maintain scripts, especially given that many script authors may not be trained programmers.
Below are examples of the API I've been thinking of, and which I'm looking to get your feedback about.
A scripter would need to build a single method which takes (as input) the configuration that the script needs to do its job, and either returns a python object or calls a method to stream back data in chunks. Optionally, a scripter could supply methods to handle startup and/or shutdown tasks.
HTTP-fetching script example (in pseudocode, omitting the actual data-fetching details to focus on the container's API):
def run (config, context, cache) :
results = http_library_call (config.url, config.http_method, config.username, config.password, ...)
return { html : results.html, status_code : results.status, headers : results.response_headers }
def init(config, context, cache) :
config.max_threads = 20 # up to 20 URLs at one time (per process)
config.max_processes = 3 # launch up to 3 concurrent processes
config.keepalive = 1200 # keep process alive for 10 mins without another call
config.process_recycle.requests = 1000 # restart the process every 1000 requests (to avoid leaks)
config.kill_timeout = 600 # kill the process if any call lasts longer than 10 minutes
Database-data fetching script example might look like this (in pseudocode):
def run (config, context, cache) :
expensive = context.cache["something_expensive"]
for record in db_library_call (expensive, context.checkpoint, config.connection_string) :
context.log (record, "logDate") # log all properties, optionally specify name of timestamp property
last_date = record["logDate"]
context.checkpoint = last_date # persistent checkpoint, used next time through
def init(config, context, cache) :
cache["something_expensive"] = get_expensive_thing()
def shutdown(config, context, cache) :
expensive = cache["something_expensive"]
expensive.release_me()
Is this API appropriately "pythonic", or are there things I should do to make this more natural to the Python scripter? (I'm more familiar with building C++/C#/Java APIs so I suspect I'm missing useful Python idioms.)
Specific questions:
is it natural to pass a "config" object into a method and ask the callee to set various configuration options? Or is there another preferred way to do this?
when a callee needs to stream data back to its caller, is a method like context.log() (see above) appropriate, or should I be using yield instead? (yeild seems natural, but I worry it'd be over the head of most scripters)
My approach requires scripts to define functions with predefined names (e.g. "run", "init", "shutdown"). Is this a good way to do it? If not, what other mechanism would be more natural?
I'm passing the same config, context, cache parameters into every method. Would it be better to use a single "context" parameter instead? Would it be better to use global variables instead?
Finally, are there existing libraries you'd recommend to make this kind of simple "script-running container" easier to write?
Have a look at SQL Alchemy for dealing with database stuff in python. Also to make script writing easier for dealing with concurrency look into Stackless Python.