I'm writing some Scala code that needs to make use of a external command line program for string translation. The external program takes many minutes to start up, then listens for data on stdin (terminated by newline), converts the data, and prints the converted data to stdout (again terminated by newline). It will remain alive forever until it receives a SIGINT.
For simplicity, let's assume the external command runs like this:
$ convert
input1
output2
input2
output2
$
convert, input1, and input2 were all typed by me; output1 and output2 were written by the program to stdout. I typed Control-C at the end to return to the shell.
In my Scala code, I'd like to start up this external program, and keep it running in the background (because it is costly to startup, but cheap to keep running once it's initialized), while providing three methods to the rest of my program with an API like:
def initTranslation(): Unit
def translate(input: String): String
def stopTranslation(): Unit
initTranslation should start up the external program and keep it running in the background.
translate should put the input argument on the stdin of the external program (followed by newline), wait for output (followed by newline), and then return the output.
stopTranslation should send SIGINT to the external program.
I've worked with Java and Scala external process management before, but don't have too much experience with Java pipes, but am not 100% sure how to hook this all up. In particular, I've read that there are subtle gotchas with regards to deadlocks when I/O pipes get hooked up in situations similar to this. I'm sure I'll need some Thread to watch start up and watch over the background process in initTranslation, some piping to send a String to stdin followed by blocking to wait for receiving data and a newline on stdout in translate, then some sort of termination of the external program in stopTranslation.
I'd like to achieve this with as much pure Scala as possible, though I realize that this may require some bits of the Java I/O library. I also do not want to use any third party Scala or Java libraries (anything outside java.*, javax.* or scala.*)
What would these three methods look like?
It turns out that this is quite a bit easier than I first expected. I had been misled by various posts and recommendations (off SO) which had suggested that this would be more complex.
Caveats to this solution:
All Java. Yes, I know I mentioned that I'd rather use the Scala standard library, but this is sufficiently succinct that I think it warrants an answer.
Limited error handling - among other things, if the external program explodes and reports errors to stderr, I'm not handling that. Certainly, that could be added on later.
Usage of var for storage of local variables. Clearly, var is frowned upon for best-practice Scala use, but this example illustrates the object state needed, and you can structure your variables in your own programs as you like.
No thread-safety. If you need thread-safety, because multiple threads might call any of the following methods, use some synchronization constructs (like the synchronized keyword in the translate method) to protect yourself.
Solution:
import java.io.BufferedReader
import java.io.InputStreamReader
import java.lang.Process
import java.lang.ProcessBuilder
var process: Process = _
var outputReader: BufferedReader = _
def initTranslation(): Unit = {
process = new ProcessBuilder("convert").start()
outputReader = new BufferedReader(new InputStreamReader(process.getInputStream()))
}
def translate(input: String): String = {
// write path to external program
process.getOutputStream.write(cryptoPath.getBytes)
process.getOutputStream.write(System.lineSeparator.getBytes)
process.getOutputStream.flush()
// wait for input from program
outputReader.readLine()
}
def stopTranslation(): Unit = {
process.destroy()
}
Related
I have a simple Node application that reads and writes a JSON file.
The following is how the Node application is executed from Scala.
case class ExecResult(exitValue: Int, stdout: String, stderr: String)
def execAsync(cmd: String)(implicit ec: ExecutionContext): Future[ExecResult] = {
val promise = Promise[ExecResult]
val proc = Process(cmd).run(ProcessLogger(...))
promise.tryCompleteWith(Future(proc.exitValue()).map(c => ExecResult(c, stdout.get, stderr.get)))
promise.future
}
The execution time of this takes almost 10 times more than executing directly.
What could be the cause of this slowness?
There's a nontrivial amount of overhead in calling out to run an external process (possibly up to and including spawning a shell (e.g. bash) to run that process).
Additionally, depending on how you're measuring this, you may also be capturing the JVM's startup and warmup phase (assuming we're talking about Scala on the JVM).
Especially if the JSON file being read is small, this overhead may swamp the actual time the Node.js application is running.
I'm not going to ask why you want to run a Node.js application from within Scala, but if you want to do something like that, I'd suggest looking at graalvm which lets you run most/any Node.js application natively in the JVM, including calling into it from e.g. Scala without the overhead of spawning an external process. Depending on the use-case, graalvm may actually be faster than the standard V8-based node implementation.
I have an external program which generate some data I need. Usually, I redirect its output to a file, then read it from my Scala application, e.g.
app.exe > output.data
Now, I want to integrate the process, so I did
val stream = "app.exe" lineStream
stream foreach { line => doWork(_) }
Unfortunately, I got GC overhead exception after a while. This app.exe may generate very large output files, e.g. over 100MB. So I think during the streaming, Scala has been creating/destroying the line string instance thousands of times, and cause the overhead.
I know I can tune the JVM variables to increase the GC overhead throttling. But I am looking for a way that it doesn't need to create a lot of small line instances.
The problem is probably due to memoization, which is a side effect of foreach-ing over a stream this way. Effectively, you are rooting the whole file in memory.
See lots and lots of info on how to avoid this here: http://blog.dmitryleskov.com/programming/scala/stream-hygiene-i-avoiding-memory-leaks/
Specifically, you are violating rule #1. Try defining your stream as a def, not a val.
I have checked java.nio.file.Files.copy but that blocks a thread until the copy is done. Are there any libraries that allow one to copy a file in a non-blocking way? I need to perform many of these operations simultaneously and cannot afford to have so many threads blocked.
While I could write something myself using non-blocking streams, I would rather use something tried and tested that would guarantee a correct copy every time (or detect if something went wrong).
Check this: Iterate over lines in a file in parallel (Scala)?
val chunkSize = 128 * 1024
val iterator = Source.fromFile(path).getLines.grouped(chunkSize)
iterator.foreach { lines =>
lines.par.foreach { line => process(line) }
}
Reading (copying) files by chunks in parallel. In this case "par" is used.
So it quite non-blocking in terms / scope of processors (cores).
But you may follow same idea of chunks, for example using Akka/Future/Promises to be even in wider scopes.
You may customize you chunk-size deepening on your performance characteristic, level of system load, etc..
One more link that explains possible way to do read / write data from (property) file in parallel using Akka Actors. This is not quite that you might be want, but it may give an idea.
Idea - you may build your own not-blocking way of reading / copying files.
--
And about your statement "While I could write something myself using non-blocking streams":
I would remind that each OS / File System (FS) may have its own vision about what and where to block. Like Windows blocks a file (write-block at leat) if one thread writes to it. On Linux is is configurable. So if you want to stick to something stable, I would suggest to think it out and go with your own wrapper (over FS) solution based on events, chunks, states.
I have used the Process class, issuing an operating system command to copy the file. Of course, one has to check under which OS the application is running, and issue the appropriate command, but this allows for fast and asynchronous copies.
As Marius rightly mentions in the comments, Scala Process blocks, so I run it wrapped in a Future.
Java 8 Process introduces a function isAlive(). A non-blocking alternative would be to use Java 8 processes and use the scheduler to poll at regular intervals to see if the process has finished. However, I did no need to go to this extent.
Have you checked out the async stuff in scala-io?
http://jesseeichar.github.io/scala-io-doc/0.4.2/index.html#!/core/async%20read%20write
Is there any possibility to pause/resume the work of embedded python interpreter in place, where I need? For example:
C++ pseudo-code part:
main()
{
script = "python_script.py";
...
RunScript(script); //-- python script runs till the command 'stop'
while(true)
{
//... read values from some variables in python-script
//... do some work ...
//... write new value to some other variables in python-script
ResumeScript(script); //-- python script resumes it's work where
// it was stopped. Not from begin!
}
...
}
Python script pseudo-code part:
#... do some init-work
while true:
#... do some work
stop # - here script stops and C++-function RunScript()
# returns control to C++-part
#... After calling C++-function ResumeScript
# the work continues from this line
Is this possible to do with Python/C API?
Thanks
I too have recently been searching for a way to manually "drive" an embedded language and I came across this question and figured I'd share a potential workaround.
I would implement the "blocking" behavior either through a socket, or some kind of messaging system. Instead of actually stopping the whole python interpreter, just have it block when it is waiting for C++ to do it's evaluations.
C++ will start the embedded runtime, then enter a loop of some sort that waits for python to "throw the signal" that it's ready. For instance C++ listens on port 5000, starts python, python does work, connects to port 5000 on localhost, then C++ sees the connection and grabs the data from python, performs work on it, then shuffles the data back over the socket to python, where python then receives the data and leaves the blocking loop.
I still need a way to fully pause the virtual runtime, but in your case you could achieve the same thing with a socket and some blocking behavior that uses the socket to coordinate the two pieces of code.
Good luck :)
EDIT: You may be able to hook this "injection" functionality used in this answer to completely stop python. Just modify it to inject a wait-loop perhaps.
Stopping embedded Python
A reactive task is sometimes seen in the IOI programming competition. Unlike batch tasks, reactive solutions take input from another program as well as outputting it. The program typically 'query' the judge program a certain number of times, then output a final answer.
An example
The client program accepts lines one by one, and simply echoes it back. When it encountered a line with "done", it exists immediately.
The client program in Java looks like this:
import java.util.*;
class Main{
public static void main (String[] args){
Scanner in = new Scanner(System.in);
String s;
while (!(s=in.nextLine()).equals("done"))
System.out.println(s);
}
}
The judge program gives the input and processes output from the client program. In this example, it feeds it a predefined input and checks if the client program has echoed it back correctly.
A session might go like this:
Judge Client
------------------
Hello
Hello
World
World
done
I'm having trouble writing the judge program and having it judge the client program. I'd appreciate if someone could write a judge program for my example.
You get programs to talk to each other via the command prompt.
On windows, you'd write:
java judge | java client
So it's piping the output of judge to the input of client.
That is to say, as long as judge is writing to the standard output stream (which it will) and client is reading from the standard input stream (which yours is) then it will work.