"Out of memory" error for standalone matlab applications - memory fragmentation - matlab

I have to deliver an application as a standalone Matlab executable to a client. The code include a series of calls to a function that internally creates several cell arrays.
My problem is that an out-of-memory error happens when the number of calls to this function increases in response to the increase in the user load. I guess this is low-level memory fragmentation as the workspace variables are independent from the number of loops.
As mentioned here, quitting and restarting Matlab is the only solution for this type of out-of-memory errors at the moment.
My question is that how I can implement such a mechanism in a standalone application to save data, quit and restart itself in the case of out-of-memory error (or when high likelihood of such an error is predicted somehow).
Is there any best practice available?
Thanks.

This is a bit of a tough one. Instead of looking to restart to clear things out, could you change the code to break the work in to chunks to make it more efficient? Fragmentation is mostly proportional to the peak cell-related memory usage and how much the size of data items varies, and less to the total usage over time. If you can break a large piece of work in to smaller pieces done in sequence, this can lower the "high water mark" of your fragmented memory usage. You can also save on memory usage by using "flyweight" data structures that share their backing data values, or sometimes converting to cell-based structures to reference objects or numeric codes. Can you share an example of your code and data structure with us?
In theory, you could get a clean slate by saving your workspace and relevant state out to a mat file and having the executable launch another instance of itself with an option to reload that state and proceed, and then having the original executable exit. But that's going to be pretty ugly in terms of user experience and your ability to debug it.
Another option would be to offload the high-fragmentation code in to another worker process which could be killed and restarted, while the main executable process survives. If you have the Parallel Computation Toolbox, which can now be compiled in to standalone Matlab executables, this would be pretty straightforward: open a worker pool of one or two workers, and run the fraggy code inside them using synchronous calls, periodically killing the workers and bringing up new ones. The workers are independent processes which start out with non-fragmented memory spaces. If you don't have PCT, you could roll your own by compiling your application as two separate apps - the driver app and worker app - and have the main app spin up a worker and control it via IPC, passing your data back and forth as MAT files or bytestreams. That's not going to be a lot of fun to code, though.
Perhaps you could also push some of the fraggy code down in to the Java layer, which handles cell-like data structures more gracefully.
Changing the code to be less fraggy in the first place is probably the simpler and easier approach, and results in a less complicated application design. In my experience it's often possible. If you share some code and data structure details, maybe we can help.

Another option is to periodically check for memory fragmentation with a function like chkmem.
You could integrate this function to be called silently from you code each couple of iterations, or use a timer object to have it called every X minutes...
The idea is to use thse undocumented functions feature memstats and feature dumpmem to get the largest free memory blocks available in addition to the largest variables currently allocated. Using that you could make a guess if there is a sign of memory fragmentation.
When detected, you would warn the user and instruct them you how to save their current session (export to MAT-file), restart the app, and restore the session upon restart.

Related

Writing to hard disk from contiguous physical memory

I have an ARM based device, running linux, which is connected to a camera, and I'm trying to store captured frames to HD efficiently.
I'm developing in user space, but can modify drivers at will
I'm coding in C
Frames which are written into memory using DMA, and I have their physical memory pointer.
I am able to control all the frame capturing flow, and I can tell when the frame buffers are stable (dqueued from the video4linux driver)
Linux version is 3.0.35
I'm familiar with kernel source code, not an expert, but I'm able to find my way in it and figure out things, as long as I get some hints...
I believe I have 2 alternatives:
Find the optimal configuration for my filesystem, for opening the file and writing into it. I'm now using ext4, and standard fopen() fwrite() functions. I understand I can also use mmap, or add O_DIRECT flag when calling open(), but didn't try it yet.
Find a way to pass the physical address of the buffer (I can get it
from my Video4Linux driver) directly to the filesystem/hard drive driver,
so the data will be transfered directly from there.
I found method 1 to be slow, having memory transactions as my bottleneck, since fwrite involves copying data from userspace to kernel space, and then again into some sort of cache, and then on to DMA. Too many memory transactions for a simple store...
Regarding method 2 - I don't know if that's possible, but if I was the one designing this system from scratch, this is what I would do.
Any thoughts?
Regarding method 1 (using open() and write(), mmap() and/or O_DIRECT)
can you recommend an optimal settings for my purpose?
Is method 2 (storing to HD directly from an existing DMA buffer) possible? If so - can you point me to an example?
the only problem with writing into a file via mmap on UNIXs, is that you either have to deal with signals in case of out-of-disk-space
or you have make certain that the file is not sparse
and thus all needed disk space is already allocated.
I think an uptodate G++ provides a method of converting signals into C++ exception handling,
but I'm not certain how supported this is on other systems than mac-os.

Does prefetching a write ever affect single core performance?

Some architectures have a "prefetch write" instruction to indicate to the CPU that you're going to be writing to a memory location before you actually do it. I understand that on a multicore machine this can be used by the core as a hint that it should try to get ownership of the given cache line now so that it can write to the location more quickly later. However, AFAICT that should only matter in situations when there are two cores potentially contending for the cache line. For a cache line that's only being read and written by a single core, does a prefetch write ever have any use?
All else being equal, Prefetch-Write has no benefit over Prefetch-Read for lines accessed only by a single core. After any kind of prefetch, the core will own the line in the Exclusive state. On a subsequent write, the line changes to the Modified state. An Exclusive-to-Modified transition is free since by definition no other core has the line. The E->M state change completes locally without snooping.
Beware that cores have their own hardware prefetch logic. An access to a line may cause the core to grab adjacent line(s) automatically. If global variables or other data reside nearby, an SMP system may experience a lot of unexpected cross-snooping.
I'd think that it could help if the cache line isn't in memory and the write prefetch flags that it will be needed some cycles from now. Housekeeping chores such as freeing up a line for the write could be more out of the way. Surely this should allow the CPU to complete the write faster than if it simply unloaded a write out of the blue into the lap of the cache?
Or have I missed something fundamental?

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.
I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."

Reasons for & against a Database

i had a discussion with a coworker about the architecture of a program i'm writing and i'd like some more opinions.
The Situation:
The Program should update at near-realtime (+/- 1 Minute).
It involves the movement of objects on a coordinate system.
There are some events that occur at regular intervals (i.e. creation of the objects).
Movements can change at any time through user input.
My solution was:
Build a server that runs continously and stores the data internally.
The server dumps a state-of-the-program at regular intervals to protect against powerfailures and/or crashes.
He argued that the program requires a Database and i should use cronjobs to update the data. I can store movement information by storing startpoint, endpoint and speed and update the position in the cronjob (and calculate collisions with other objects there) by calculating direction and speed.
His reasons:
Requires more CPU & Memory because it runs constantly.
Powerfailures/Crashes might destroy data.
Databases are faster.
My reasons against this are mostly:
Not very precise as events can only occur at full minutes (wouldn't be that bad though).
Requires (possibly costly) transformation of data on every run from relational data to objects.
RDBMS are a general solution for a specialized problem so a specialized solution should be more efficient.
Powerfailures (or other crashes) can leave the Data in an undefined state with only partially updated data unless (possibly costly) precautions (like transactions) are taken.
What are your opinions about that?
Which arguments can you add for any side?
Databases are not faster. How silly... How can a database be faster than writing a custom data structure and storing it in memory ?? Databases are Generalized tools to persist data to disk for you so you don't have to write all the code to do that yourself. Because they have to address the needs of numerous disparate (and sometimes inconsistent) business functions (Persistency (Durability), Transactional integrity, caching, relational integrity, atomicity, etc. etc. ) and do it in a way that protects the application developer from having to worry about it so much, by definition it is going to be slower. That doesn't necessarilly mean his conclusion is wrong however.
Each of his other objections can be addressed by writing the code to address that issue yourself... But you see where that is going... At some point, the development efforts of writing the custom code to address the issues that are important for your application outweigh the performance hit of just using a database - which already does all that stuff out of the box... How many of these issues are important ? and do you know how to write the code necessary to address them ?
From what you've described here, I'd say your solution does seem to be the better option. You say it runs once a minute, but how long does it take to run? If only a few seconds, then the transformation to relational data would likely be inconsequential, as would any other overhead. most of this would take likely 30 seconds. This is assuming, again, that the program is quite small.
However, if it is larger, and assuming that it will get larger, doing a straight dump is a better method. You might not want to do a full dump every run, but that's up to you, just remember that it could wind up taking a lot of space (same goes if you're using a database).
If you're going to dump the state, you would need to have some sort of a redundancy system in place, along with quasi-transactions. You would want to store several copies, in case something happens to the newest version. Say, the power goes out while you're storing, and you have no backups beyond this half-written one. Transactions, you would need something to tell that the file has been fully written, so if something does go wrong, you can always tell what the most recent successful save was.
Oh, and for his argument of it running constantly: if you have it set to a cronjob, or even a self-enclosed sleep statement or similar, it doesn't use any CPU time when it's not running, the same amount that it would if you're using an RDBMS.
If you're writing straight to disk, then this will be the faster method over a database, and faster retrieval, since, as you pointed out, there is no overhead.
Summary: A database is a good idea if you have a lot of idle processor time or historical records, but if resources are a legitimate concern, then it can become too much overhead and a dump with precautions taken is better.
mySQL can now model spatial data.
http://dev.mysql.com/doc/refman/4.1/en/gis-introduction.html
http://dev.mysql.com/doc/refman/5.1/en/spatial-extensions.html
You could use the database to keep track of world locations, user locations, items locations ect.

Is it a good idea to warm up a cache in the BEGIN block in Perl?

Is it a good idea to warm up cache in the BEGIN block, when it gets used?
You didn't really provide any information on what kind of environment you're talking about, which I think is important. In most cases the answer is probably "no", but I can think of one case where it's a definite yes, which is preforking servers -- web applications and the like. In that case, any work that you can do "before the fork" not only saves the cost of having the children recompute the same values individually, it alo saves memory, since the pages containing the results can be shared across all of the child processes by the OS's COW mechanism.
If you're talking about a module you're writing and not an application, then I'd say no, don't lift things to compilation time without the user's permission unless they're things that have to be done for the module to work. Instead, provide a preheat_cache class method, and if your caller has a reason to need a hot cache at compile time they can put the call into a BEGIN block themselves. You could also use a :preheat_cache import tag but that's unnecessarily fancy in my book.
If it's a choice between preloading your cache at compile time, or preloading your cache as the first thing you do at run time, there's virtually no difference.
If your cache is large enough that loading it will trigger a lot of page swaps, that's an argument for waiting until run time. That way, all your module loading and other compile time code can be done while your system is under a lighter load.
I'm going to go with "no", even though I could be wrong. Reasoning goes like this: keep the code, and data it uses, small, so that it takes up less space in any caches (I am presuming you mean CPU cache, not programmatic hashes with common query results or some such thing).
Unless you see some sort of bad access pattern, trying to second guess what needs to be prefetched is probably useless at best. In fact such code or initialization data is likely to displace something you (or another process on the system) were actually using. Think about what you can do in the actual work part of the code to maximize locality of reference, to try to stay within smaller memory regions at any one time.
I used to use "top" to detect when processes were swapping between memory and disk. I don't know of any good tools yet to tell how often a process is getting cache misses and going to plain old slow mo'board memory. There must be such tools, I just don't know what they are yet (software tools, rather than some custom In Circuit Emulator type hardware). Perhaps some thought on this earlier in the day...
by warm up I assume you mean use BEGIN() to guarantee the cache is preloaded before anything else in your script executes?
If you need the cache for your program to run properly, then yes, I think it would be a good idea.