Perl "Out of memory!" when processing a large batch job - perl

A few others and I are now the happy maintainers of a few legacy batch jobs written in Perl. About 30k lines of code, split across maybe 10-15 Perl files.
We have a lot of long-term fixes for improving how the batch process works, but in the short term, we have to keep the lights on for the various other projects that depend on the output of these batch jobs.
At the core of the main part of these batch jobs is a hash that is loaded up with a bunch of data collected from various data files in a bunch of directories. When these were first written, everything fit nicely into memory - no more than 100MB or so. Things of course grew over the years, and the hash now grows up to what the box can handle (8GB), leaving us with a nice message from Perl:
Out of memory!
This is, of course, a poor design for a batch job, and we have a clear (long-term) roadmap to improve the process.
I have two questions however:
What kind of short-term options can we look at, short of throwing more memory at the machine? Any OS settings that can be tweaked? Perl runtime/compile flags that can be set?
I'd also like to understand WHY Perl crashes with the "out of memory!" error, as opposed to using the swap space that is available on machine.
For reference, this is running on a Sun SPARC M3000 running Solaris 10 with 8 cores, 8 GB RAM, 10 GB swap space.
The reason throwing more memory at the machine is not really an ideal solution is mostly because of the hardware it's running on. Buying more memory for these Sun boxes is crazy expensive compared to the x86 world, and we probably won't be keeping these around much longer than another year.
The long-term solution is of course refactoring a lot of the codebase, and moving to Linux on x86.

There aren't really any generally-applicable methods of reducing a program's memory footprint; it takes someone familiar with Perl to scan the code and find something relevant to your specific situation
You may find that storing your hash as a disk-based database helps, and the more general way is to use Tie::Hash::DBD which will allow you to use any database that DBI supports, but it won't help with hashes whose values can be references, such as nested hashes. (As ThisSuitIsBlackNot has commented, DBM::Deep overcomes even this obstacle.)
I presume your Perl code is crashing at startup? If you have a memory leak then it should be simpler to find the cause. Alternatively it may be obvious to you that the initial population of the hash is wasteful in that it is storing data that will never be used. If you show that part of your code then I am sure someone will be able to assist

Try to use 64bit version of interpreter. I had the same issue with "Out of memory" message. In my case 32bit strawberry perl ate 2Gb of RAM before termination. 64bit version of interpreter can use bigger amount. It ate the rest of my 16Gb and than started to swap like hell. But I received a result.

Related

Scala: performance boost on incremental garbage collection

I have written an application in Scala. Basically, the first step is to create a array of objects an then to initialise these objects from a csv file. When running the application on the jvm it is really slow, and after some experimenting I found out that using the -J-Xincgc flag which enables incremental garbage collection speeds up the application by a factor of 4 (it's 4 times faster with the switch!). I wonder:
Why?
Did I use some inefficient coding, and if so, where should I start to find out whats going on?
Thanks!
I'll assume you're running this on hotspot.
The hotspot JVM has a whole zoo of garbage collectors, most of which also may have some sort of sub-modes or various command-line switches that significantly alter their behavior.
Which GC is used by default varies based on JVM version, operating system and 32/64bit VM.
So you basically changed whatever the default was to a specific algorithm that happened to perform "faster" for your workload.
But "faster" is a fuzzy measure. Wall time is not the same as CPU cycles spent if you consider multi-threading. And some collectors may simply choose to grow the heap more aggressively, thus deferring the cost of collection to a later point in time, which you might not have measured if your program didn't run long enough.
To make an accurate assessment much more information would be needed
what GC was used by default
your VM version
how many cores your CPU has
what kind of workload do you have (multi/single-thread, long/short-running, expected memory footprint, object allocation rate)
Oracle's GC tuning guide may prove useful for you
In your case, -Xincgc translates to CMS in incremental mode, which is intended for single-core environments and has been deprecated as of java8. It probably just happened to be better than the default, but it's not necessarily an optimal choice.
If you get into a situation where you are running close to your heap-size limit, you can waste a lot of GC time, which can lead to a lot of false findings about performance. If that's your situation, first increase your heap-size limit before doing anything else. Consider use of jvisualvm to eyeball the situation - it's trivially easy to get started with.

Threading vs Forking (with explanation of what I want to do)

So, I've reviewed a ton of articles and forums before posting this, but I keep reading conflicting answers. Firstly, OS is not an issue, I can use either Windows or Unix, whatever would be best for my problem. I have a ton of data that I need to use for read-only purposes (not sure why this would matter, but, in case it does, the data structure that I'm going to have to go through is an array of arrays of arrays of hashes whose values are also arrays). I'm essentially comparing a "query" to a ton of different "sentences" and computing their relative similarities. From these quantities (several million), I want to take the top x% and do something with them. I need to parallelize this process. There's just no good way for me to decrease the space--I need to compare over everything to get good results and it will just take too long with some sort of threading/forking. Again, I've seen many conflicting answers and don't know which one to do.
Any help would be appreciated. Thanks in advance.
EDIT: I don't think the amount of memory usage will be an issue, but I don't know (8 GB RAM)
Without more detail on your problem, there's not much help that can be given. You want to parallelize a process. Threads and forks in Perl have advantages and disadvantages.
One of the key things that makes Perl threads different from other threads is that data is not shared by default. This makes threads much easier and safer to work with, you don't have to worry about thread safety of libraries or most of your code, just the threaded bit. However it can be a performance drag and memory hungry as Perl must put a copy of the interpreter and all loaded modules into each thread.
When it comes to forking I will only be talking about Unix. Perl emulates fork on Windows using threads, it works but it can be slow and buggy.
Forking Advantages
Very fast to create a fork
Very robust
Forking Disadvantages
Communicating between the processes can be slow and awkward
Thread Advantages
Thread coordination and data interchange is fairly easy
Threads are fairly easy to use
Thread Disadvantages
Each thread takes a lot of memory
Threads can be slow to start
Threads can be buggy (better the more recent your perl)
Database connections are not shared across threads
That last one is a bit of a doozy if the documentation is up to date. If you're going to be doing a lot of SQL, don't use threads.
In general, to get good performance out of Perl threads it's best to start a pool of threads and reuse them. Forks can more easily be created, used and discarded.
Really what it comes down to is what fits your way of thinking and your particular problem.
For either case, you're likely going to want something to manage your pool of workers. For forking you're going to want to use Parallel::ForkManager or Child. Child is particularly nice as it has built in inter-process communication.
For threads you're going to want to use threads::shared, Thread::Queue and read perlthrtut.
When reading articles about Perl threads, keep in mind they were a bit crap when they were introduced in 5.8.0 in 2002, and only serviceable by 5.10.1. After that they've firmed up considerably. Information and opinions about their efficiency and robustness tends to fall rapidly out of date.
Threading can be more difficult to get correct, but won't utilize as much memory.
Forking can be simpler to implement but use a significant amount of memory.
If you don't have experience with either, I would start by implemented a forking version & go from there.

Looking for a Perl module to store a Hash structure in shared RAM

I'd like to store a data structure persistently in RAM and have it accessible from pre-forked
web server processes in Perl.
Ideally I would like it to behave like memcached but without the need for a separate daemon. Any ideas?
Use Cache::FastMmap and all you need is a file. It uses mmap to provide a shared in-memory cache for IPC, which means it is quite fast. See the documentation for possible issues and caveats.
IPC::SharedMem might fit the bill.
Mod_perl shares RAM on systems with properly implemented copy-on-write forking. Load your Perl hash in a BEGIN block of your mod_perl program, and all forked instances of the mod_perl program will share the memory, as long as there are no writes to the pages storing your hash. This doesn't work perfectly (some pages will get written to) but on my servers and data it decreases memory usage by 70-80%.
Mod_perl also speeds up your server by eliminating the compile-time for Perl on subsequent web requests. The downside of mod_perl that you have to program carefully, and avoid programs that modify global variables, since these variables, like your hash, are shared by all the mod_perl instances. It is worthwhile to learn enough Perl so that you don't need to change globals, anyway!
The performance gains from mod_perl are fantastic, but mod_perl is not available in many shared hosts. It is easy to screw up, and hard to debug while you are learning it. I only use it when the performance improvements are appreciated enough by my customers to justify my development pain.

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.
I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."

How to efficiently process 300+ Files concurrently in scala

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.
Ideas and suggestions are highly appreciated.
EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).
Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.
If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.
I can't comment on solid-state drives, as I have no first-hand experience with their performance.
You are quite screwed, indeed.
Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:
Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!
Process those 512 KB using Futures.
Go back to step 1, unless you are finished with the files.
Get the result back from the processing Futures.
On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.
By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.
Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.
Of course you may be wanting to do a much deeper comparison than just "are these files the same?"
If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match.
Here is some java source to make that happen
many large systems that handle data use sha1 Including the NSA and git
Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.
Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.
I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.