Scala hit GC overhead when running large external process - scala

I have an external program which generate some data I need. Usually, I redirect its output to a file, then read it from my Scala application, e.g.
app.exe > output.data
Now, I want to integrate the process, so I did
val stream = "app.exe" lineStream
stream foreach { line => doWork(_) }
Unfortunately, I got GC overhead exception after a while. This app.exe may generate very large output files, e.g. over 100MB. So I think during the streaming, Scala has been creating/destroying the line string instance thousands of times, and cause the overhead.
I know I can tune the JVM variables to increase the GC overhead throttling. But I am looking for a way that it doesn't need to create a lot of small line instances.

The problem is probably due to memoization, which is a side effect of foreach-ing over a stream this way. Effectively, you are rooting the whole file in memory.
See lots and lots of info on how to avoid this here: http://blog.dmitryleskov.com/programming/scala/stream-hygiene-i-avoiding-memory-leaks/
Specifically, you are violating rule #1. Try defining your stream as a def, not a val.

Related

How long does a Scala Spark job take to process a million lines in a file?

I have a file called file1 in HDFS that contains paths of several files:
this/is/path1
this/is/path2
this/is/path3
.
.
.
this/is/path1000000
If I get all the lines from this file as a list by executing the following line in Scala,
val lines=Source.fromFile("/my/path/file1.txt").getLines.toList
and if I use a 'for' loop as follows, to process each line of file1 in a separate function that involves some mapping functionality for each line,
for(i<-lines){
val firstLines=sc.hadoopFile(i,classOf[TextInputFormat],classOf[LongWritable],classOf[Text]).flatMap {
case (k, v) => if (k.get == 0) Seq(v.toString) else Seq.empty[String]
}
}
how long will this take to run, given that file1 contains roughly more than a million lines? This scala job has been running on my machine for more than an hour and I would like to know if it has gotten stuck anywhere or is going through an infinite loop, or something like that.
That is a bit of a loaded question. But it shouldn't take long in general. My guess is something has gone wrong. From personal experience, I would guess you don't have enough executors available.
Memory gets a lot of focus with spark, but the number of available executors has given me more fits than memory issues. Especially because you will see behavior like this where it won't error out. It will just stall indefinitely.
That said, that is just a guess with very little knowledge about the job and env. Time to debug on your part and see if you can't find the issue or come back with a more specific problem/question.

Use persistent external program for occasional input / output translation in Scala

I'm writing some Scala code that needs to make use of a external command line program for string translation. The external program takes many minutes to start up, then listens for data on stdin (terminated by newline), converts the data, and prints the converted data to stdout (again terminated by newline). It will remain alive forever until it receives a SIGINT.
For simplicity, let's assume the external command runs like this:
$ convert
input1
output2
input2
output2
$
convert, input1, and input2 were all typed by me; output1 and output2 were written by the program to stdout. I typed Control-C at the end to return to the shell.
In my Scala code, I'd like to start up this external program, and keep it running in the background (because it is costly to startup, but cheap to keep running once it's initialized), while providing three methods to the rest of my program with an API like:
def initTranslation(): Unit
def translate(input: String): String
def stopTranslation(): Unit
initTranslation should start up the external program and keep it running in the background.
translate should put the input argument on the stdin of the external program (followed by newline), wait for output (followed by newline), and then return the output.
stopTranslation should send SIGINT to the external program.
I've worked with Java and Scala external process management before, but don't have too much experience with Java pipes, but am not 100% sure how to hook this all up. In particular, I've read that there are subtle gotchas with regards to deadlocks when I/O pipes get hooked up in situations similar to this. I'm sure I'll need some Thread to watch start up and watch over the background process in initTranslation, some piping to send a String to stdin followed by blocking to wait for receiving data and a newline on stdout in translate, then some sort of termination of the external program in stopTranslation.
I'd like to achieve this with as much pure Scala as possible, though I realize that this may require some bits of the Java I/O library. I also do not want to use any third party Scala or Java libraries (anything outside java.*, javax.* or scala.*)
What would these three methods look like?
It turns out that this is quite a bit easier than I first expected. I had been misled by various posts and recommendations (off SO) which had suggested that this would be more complex.
Caveats to this solution:
All Java. Yes, I know I mentioned that I'd rather use the Scala standard library, but this is sufficiently succinct that I think it warrants an answer.
Limited error handling - among other things, if the external program explodes and reports errors to stderr, I'm not handling that. Certainly, that could be added on later.
Usage of var for storage of local variables. Clearly, var is frowned upon for best-practice Scala use, but this example illustrates the object state needed, and you can structure your variables in your own programs as you like.
No thread-safety. If you need thread-safety, because multiple threads might call any of the following methods, use some synchronization constructs (like the synchronized keyword in the translate method) to protect yourself.
Solution:
import java.io.BufferedReader
import java.io.InputStreamReader
import java.lang.Process
import java.lang.ProcessBuilder
var process: Process = _
var outputReader: BufferedReader = _
def initTranslation(): Unit = {
process = new ProcessBuilder("convert").start()
outputReader = new BufferedReader(new InputStreamReader(process.getInputStream()))
}
def translate(input: String): String = {
// write path to external program
process.getOutputStream.write(cryptoPath.getBytes)
process.getOutputStream.write(System.lineSeparator.getBytes)
process.getOutputStream.flush()
// wait for input from program
outputReader.readLine()
}
def stopTranslation(): Unit = {
process.destroy()
}

How to have multiple instances of MATLAB save the same file simultaneously

I am currently writing code to run a series of time-consuming experiments using nodes on a Unix cluster. Each of these experiments takes over 3 days runs on a a 12-core machine. When each experiment is done, I am hoping to have it save some data to a common file.
I have a slight issue in that I submit all of my experiments to the cluster at the same time and so they are likely to be saving to the same file at the same time as well.
I am wondering what will happen when multiple instances of MATLAB try to save the same file at the same time (error/crash/nothing). Whatever the outcome, could I work around it using a try/catch loop as follows:
n_tries = 0;
while n_tries < 10
try
save('common_file',data)
n_tries = 10;
catch
wait_time = 60 * rand;
pause(wait_time);
n_tries = n_tries+1;
end
end
end
Don't.
All Matlab functions are explicitly not safe to use in a multi-threading/processing environment.
If you write to one mat-file simultaneously from multiple matlab sessions, chances are good that either several variables are missing (because e.g. 2 matlab append to the same state of the file) or the whole file gets corrupted.
Save individual files and merge them in a post-processing step.
For such long simulation runs, don't aggregate your data automatically unless you have a reliable framework. There are several reasons:
Out of Memory exceptions or similar while writing can destroy all previous results, this is likely to happen while writing large amounts of data.
Coding errors can destroy previous results. Your code will overwrite at least the most recent added data in case of a collision.
Undetected errors in mex functions, which by randomly hit the matlab address space instead of casing a segmentation fault, can cause Matlab to write crap to your Matfile and destroy previous results.
Use some unique pattern, e.g. pc-name + current date/time
You would be best served by having a single recorder task that does the file output and queue the save information to that task.
Don't forget that the output "file" that you supply to the matlab only has to be file like - i.e. support the necessary methods.

Is there a way to copy files in a non-blocking way in Scala?

I have checked java.nio.file.Files.copy but that blocks a thread until the copy is done. Are there any libraries that allow one to copy a file in a non-blocking way? I need to perform many of these operations simultaneously and cannot afford to have so many threads blocked.
While I could write something myself using non-blocking streams, I would rather use something tried and tested that would guarantee a correct copy every time (or detect if something went wrong).
Check this: Iterate over lines in a file in parallel (Scala)?
val chunkSize = 128 * 1024
val iterator = Source.fromFile(path).getLines.grouped(chunkSize)
iterator.foreach { lines =>
lines.par.foreach { line => process(line) }
}
Reading (copying) files by chunks in parallel. In this case "par" is used.
So it quite non-blocking in terms / scope of processors (cores).
But you may follow same idea of chunks, for example using Akka/Future/Promises to be even in wider scopes.
You may customize you chunk-size deepening on your performance characteristic, level of system load, etc..
One more link that explains possible way to do read / write data from (property) file in parallel using Akka Actors. This is not quite that you might be want, but it may give an idea.
Idea - you may build your own not-blocking way of reading / copying files.
--
And about your statement "While I could write something myself using non-blocking streams":
I would remind that each OS / File System (FS) may have its own vision about what and where to block. Like Windows blocks a file (write-block at leat) if one thread writes to it. On Linux is is configurable. So if you want to stick to something stable, I would suggest to think it out and go with your own wrapper (over FS) solution based on events, chunks, states.
I have used the Process class, issuing an operating system command to copy the file. Of course, one has to check under which OS the application is running, and issue the appropriate command, but this allows for fast and asynchronous copies.
As Marius rightly mentions in the comments, Scala Process blocks, so I run it wrapped in a Future.
Java 8 Process introduces a function isAlive(). A non-blocking alternative would be to use Java 8 processes and use the scheduler to poll at regular intervals to see if the process has finished. However, I did no need to go to this extent.
Have you checked out the async stuff in scala-io?
http://jesseeichar.github.io/scala-io-doc/0.4.2/index.html#!/core/async%20read%20write

hosting simple python scripts in a container to handle concurrency, configuration, caching, etc

My first real-world Python project is to write a simple framework (or re-use/adapt an existing one) which can wrap small python scripts (which are used to gather custom data for a monitoring tool) with a "container" to handle boilerplate tasks like:
fetching a script's configuration from a file (and keeping that info up to date if the file changes and handle decryption of sensitive config data)
running multiple instances of the same script in different threads instead of spinning up a new process for each one
expose an API for caching expensive data and storing persistent state from one script invocation to the next
Today, script authors must handle the issues above, which usually means that most script authors don't handle them correctly, causing bugs and performance problems. In addition to avoiding bugs, we want a solution which lowers the bar to create and maintain scripts, especially given that many script authors may not be trained programmers.
Below are examples of the API I've been thinking of, and which I'm looking to get your feedback about.
A scripter would need to build a single method which takes (as input) the configuration that the script needs to do its job, and either returns a python object or calls a method to stream back data in chunks. Optionally, a scripter could supply methods to handle startup and/or shutdown tasks.
HTTP-fetching script example (in pseudocode, omitting the actual data-fetching details to focus on the container's API):
def run (config, context, cache) :
results = http_library_call (config.url, config.http_method, config.username, config.password, ...)
return { html : results.html, status_code : results.status, headers : results.response_headers }
def init(config, context, cache) :
config.max_threads = 20 # up to 20 URLs at one time (per process)
config.max_processes = 3 # launch up to 3 concurrent processes
config.keepalive = 1200 # keep process alive for 10 mins without another call
config.process_recycle.requests = 1000 # restart the process every 1000 requests (to avoid leaks)
config.kill_timeout = 600 # kill the process if any call lasts longer than 10 minutes
Database-data fetching script example might look like this (in pseudocode):
def run (config, context, cache) :
expensive = context.cache["something_expensive"]
for record in db_library_call (expensive, context.checkpoint, config.connection_string) :
context.log (record, "logDate") # log all properties, optionally specify name of timestamp property
last_date = record["logDate"]
context.checkpoint = last_date # persistent checkpoint, used next time through
def init(config, context, cache) :
cache["something_expensive"] = get_expensive_thing()
def shutdown(config, context, cache) :
expensive = cache["something_expensive"]
expensive.release_me()
Is this API appropriately "pythonic", or are there things I should do to make this more natural to the Python scripter? (I'm more familiar with building C++/C#/Java APIs so I suspect I'm missing useful Python idioms.)
Specific questions:
is it natural to pass a "config" object into a method and ask the callee to set various configuration options? Or is there another preferred way to do this?
when a callee needs to stream data back to its caller, is a method like context.log() (see above) appropriate, or should I be using yield instead? (yeild seems natural, but I worry it'd be over the head of most scripters)
My approach requires scripts to define functions with predefined names (e.g. "run", "init", "shutdown"). Is this a good way to do it? If not, what other mechanism would be more natural?
I'm passing the same config, context, cache parameters into every method. Would it be better to use a single "context" parameter instead? Would it be better to use global variables instead?
Finally, are there existing libraries you'd recommend to make this kind of simple "script-running container" easier to write?
Have a look at SQL Alchemy for dealing with database stuff in python. Also to make script writing easier for dealing with concurrency look into Stackless Python.