Fastest way to reset or free copy-on-write mapped area - mmap

I have a CoW region of memory that I need to reset to the original state.
Sadly, MADV_DONTNEED behaves exactly the same as munmap, and is seemingly freeing all pages. munmap is extremely expensive and the performance is horrendous to say the least, and its way cheaper to create a new $thing from scratch using MAP_ANONYMOUS, initialize it manually, then munmap that. That makes zero sense to me, and just shows something is really broken with mmap and CoW mappings. Unfortunately for me I really need CoW. It's that or memcpy from one range to another, and since we are in 2021 I expect that Linux will be able to do copy-on-write.
See: https://kostja.github.io/2012/04/04/1111.html
I would like to only discard the dirty pages of my memory range.
munmap anon: 435110ns (435 micros)
munmap memfd: 21958015ns (21958 micros)
This is the average time when freeing 400x 128MB ranges. If I only free 1 range then I get sane numbers, so there's some kind of bad scaling going on in the kernel that I don't understand. The tmpfs-backed area is untouched after allocating it with MAP_NORESERVE. This is completely insane. Are memfd (tmpfs-backed) files just that slow?

The fastest way ended up being to use the hardware virtualization itself to implement copy-on-write mechanisms. It ended up being extremely complex, fraught with footguns, but most importantly very fast. It is possible to use just a few pages of working memory to call into a copy-on-write VM. Most of the pages are duplicated page table entries.
Additionally, this opens up the possibility for copies of copies, as well as flattening a copy-on-write VM so that it can be used as master.
Linux has no support for this whatsoever.

Related

What happens when an AppendStructuredBuffer overflows in a compute shader?

I have a Unity project in which I'm writing to an AppendStructuredBuffer<Triangle> via Append(triangle) in a compute shader.
In this instance, I know the theoretical limit to the number of triangles that could exist, so the obvious correct approach is to size the buffer accordingly. As a hack, though, I'm experimenting with allocating drastically smaller buffers so that they can be more efficiently processed by other parts of the system (in particular, reading back to CPU). One could imagine other situations in which a specific limit may not be known, or could be wrongly assumed.
Clearly, this is potentially hazardous. I'm sure there are more robust approaches that could be used for my current system (or more generally) without sacrificing performance, but I'm not (particularly) asking for advice on that.
What I want to know is what the expected behaviour is when a program calls Append() beyond the capacity of such a buffer. I imagine that it is undefined, and potentially liable to corrupt other areas of VRAM, to an extent dependent on GPU drivers / DirectX version etc. It may be that it is more formally specified, but I haven't been able to find that out.
Of course, even if the behaviour is specified, it seems somewhat reckless to deliberately risk. Still, I'd like to know:
Whether it is possible to detect that such a buffer is full in the context of a kernel function (given the highly threaded nature this is likely impractical).
What the performance implications of that are if it is possible.
What the consequences of overflowing are (in this instance I'm specifically anticipating it, but bugs happen).
How all of the above might be expected to differ for different hardware vendors, APIs, etc.
Perhaps it is 'safe' to the extent that excess data will simply be lost to the void without cost. In any case the system can - for example - periodically check fullness of buffers and do any extra housework that may be necessary... leaving the question of how severe any mistakes in the tuning of such a system might be.
Under many circumstances, at least in DirectX, out of bounds access is defined as returning 0. I'm still not totally sure about writes, but think there is reason to believe they should be generally safe in current implementations.
I would still be very wary of relying on this, especially when using other APIs.
According to this specification,
5.3.10.2 Using Unordered Count and Append Buffers
...
The counter behind imm_atomic_alloc and imm_atomic_consume has no
overflow or underflow clamping, and there is no feedback given to the
shader as to whether overflow/underflow happened (wrapping of the
counter). The only thing the counter really accomplishes is a way of
generating unique addresses that is conveniently bundled with the UAV.
Further, https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#inst_IMM_ATOMIC_ALLOC
There is no clamping of the count, so it wraps on overflow.
I don't think I'm wrong in interpretting 'wrapping' as being to the length of the buffer in these instances.
So, the answer as I understand it is that on Append() the internal counter will wrap, and subsequent invocations will end up overwriting earlier data. As it happens, I am currently rendering my buffer without reference to such a counter (because I do another pass on the 'triangles' to turn them into vertices for rendering, which I currently do on a non-AppendBuffer). I should experiment with passing a buffer with a count to that draw call, which should allow me to verify whether most of my model suddenly disappears when I overflow.
In any case, it seems that the operation should be safe in terms of not corrupting other parts of the system, but that referring to the counter may be the wrong way to detect problems.

Scala: performance boost on incremental garbage collection

I have written an application in Scala. Basically, the first step is to create a array of objects an then to initialise these objects from a csv file. When running the application on the jvm it is really slow, and after some experimenting I found out that using the -J-Xincgc flag which enables incremental garbage collection speeds up the application by a factor of 4 (it's 4 times faster with the switch!). I wonder:
Why?
Did I use some inefficient coding, and if so, where should I start to find out whats going on?
Thanks!
I'll assume you're running this on hotspot.
The hotspot JVM has a whole zoo of garbage collectors, most of which also may have some sort of sub-modes or various command-line switches that significantly alter their behavior.
Which GC is used by default varies based on JVM version, operating system and 32/64bit VM.
So you basically changed whatever the default was to a specific algorithm that happened to perform "faster" for your workload.
But "faster" is a fuzzy measure. Wall time is not the same as CPU cycles spent if you consider multi-threading. And some collectors may simply choose to grow the heap more aggressively, thus deferring the cost of collection to a later point in time, which you might not have measured if your program didn't run long enough.
To make an accurate assessment much more information would be needed
what GC was used by default
your VM version
how many cores your CPU has
what kind of workload do you have (multi/single-thread, long/short-running, expected memory footprint, object allocation rate)
Oracle's GC tuning guide may prove useful for you
In your case, -Xincgc translates to CMS in incremental mode, which is intended for single-core environments and has been deprecated as of java8. It probably just happened to be better than the default, but it's not necessarily an optimal choice.
If you get into a situation where you are running close to your heap-size limit, you can waste a lot of GC time, which can lead to a lot of false findings about performance. If that's your situation, first increase your heap-size limit before doing anything else. Consider use of jvisualvm to eyeball the situation - it's trivially easy to get started with.

Key Value storage without a file system?

I am working on an application, where we are writing lots and lots of key value pairs. On production the database size will run into hundreds of Terabytes, even multiple Petabytes. The keys are 20 bytes and the value is maximum 128 KB, and very rarely smaller than 4 KB. Right now we are using MongoDB. The performance is not very good, because obviously there is a lot of overhead going on here. MongoDB writes to the file system, which writes to the LVM, which further writes to a RAID 6 array.
Since our requirement is very basic, I think using a general purpose database system is hitting the performance. I was thinking of implementing a simple database system, where we could put the documents (or 'values') directly to the raw drive (actually the RAID array), and store the keys (and a pointer to where the value lives on the raw drive) in a fast in-memory database backed by an SSD. This will also speed-up the reads, as all there would not be no fragmentation (as opposed to using a filesystem.)
Although a document is rarely deleted, we would still have to maintain a pool of free space available on the device (something that the filesystem would have provided).
My question is, will this really provide any significant improvements? Also, are there any document storage systems that do something like this? Or anything similar, that we can use as a starting poing?
Apache Cassandra jumps to mind. It's the current elect NoSQL solution where massive scaling is concerned. It sees production usage at several large companies with massive scaling requirements. Having worked a little with it, I can say that it requires a little bit of time to rethink your data model to fit how it arranges its storage engine. The famously citied article "WTF is a supercolumn" gives a sound introduction to this. Caveat: Cassandra really only makes sense when you plan on storing huge datasets and distribution with no single point of failure is a mission critical requirement. With the way you've explained your data, it sounds like a fit.
Also, have you looked into redis at all, at least for saving key references? Your memory requirements far outstrip what a single instance would be able to handle but Redis can also be configured to shard. It isn't its primary use case but it sees production use at both Craigslist and Groupon
Also, have you done everything possible to optimize mongo, especially investigating how you could improve indexing? Mongo does save out to disk, but should be relatively performant when optimized to keep the hottest portion of the set in memory if able.
Is it possible to cache this data if its not too transient?
I would totally caution you against rolling your own with this. Just a fair warning. That's not a knock at you or anyone else, its just that I've personally had to maintain custom "data indexes" written by in house developers who got in way over their heads before. At my job we have a massive on disk key-value store that is a major performance bottleneck in our system that was written by a developer who has since separated from the company. It's frustrating to be stuck such a solution among the exciting NoSQL opportunities of today. Projects like the ones I cited above take advantage of the whole strength of the open source community to proof and optimize their use. That isn't something you will be able to attain working on your own solution unless you make a massive investment of time, effort and promotion. At the very least I'd encourage you to look at all your nosql options and maybe find a project you can contribute to rather than rolling your own. Writing a database server itself is definitely a nontrivial task that needs a huge team, especially with the requirements you've given (but should you end up doing so, I wish you luck! =) )
Late answer, but for future reference I think Spider does this

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.
I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."

Is it a good idea to warm up a cache in the BEGIN block in Perl?

Is it a good idea to warm up cache in the BEGIN block, when it gets used?
You didn't really provide any information on what kind of environment you're talking about, which I think is important. In most cases the answer is probably "no", but I can think of one case where it's a definite yes, which is preforking servers -- web applications and the like. In that case, any work that you can do "before the fork" not only saves the cost of having the children recompute the same values individually, it alo saves memory, since the pages containing the results can be shared across all of the child processes by the OS's COW mechanism.
If you're talking about a module you're writing and not an application, then I'd say no, don't lift things to compilation time without the user's permission unless they're things that have to be done for the module to work. Instead, provide a preheat_cache class method, and if your caller has a reason to need a hot cache at compile time they can put the call into a BEGIN block themselves. You could also use a :preheat_cache import tag but that's unnecessarily fancy in my book.
If it's a choice between preloading your cache at compile time, or preloading your cache as the first thing you do at run time, there's virtually no difference.
If your cache is large enough that loading it will trigger a lot of page swaps, that's an argument for waiting until run time. That way, all your module loading and other compile time code can be done while your system is under a lighter load.
I'm going to go with "no", even though I could be wrong. Reasoning goes like this: keep the code, and data it uses, small, so that it takes up less space in any caches (I am presuming you mean CPU cache, not programmatic hashes with common query results or some such thing).
Unless you see some sort of bad access pattern, trying to second guess what needs to be prefetched is probably useless at best. In fact such code or initialization data is likely to displace something you (or another process on the system) were actually using. Think about what you can do in the actual work part of the code to maximize locality of reference, to try to stay within smaller memory regions at any one time.
I used to use "top" to detect when processes were swapping between memory and disk. I don't know of any good tools yet to tell how often a process is getting cache misses and going to plain old slow mo'board memory. There must be such tools, I just don't know what they are yet (software tools, rather than some custom In Circuit Emulator type hardware). Perhaps some thought on this earlier in the day...
by warm up I assume you mean use BEGIN() to guarantee the cache is preloaded before anything else in your script executes?
If you need the cache for your program to run properly, then yes, I think it would be a good idea.