Memory Mapped files and atomic writes of single blocks - atomic

If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
How do I achieve the same effect on a memory mapped file?
Memory mapped files are simply byte arrays, so if I modify the byte array, the operating system has no way of knowing when I consider a write "done", so it might (even if that is unlikely) swap out the memory just in the middle of my block-writing operation, and in effect I write half a block.
I'd need some sort of a "enter/leave critical section", or some method of "pinning" the page of a file into memory while I'm writing to it. Does something like that exist? If so, is that portable across common POSIX systems & Windows?

The technique of keeping a journal seems to be the only way. I don't know how this works with multiple apps writing to the same file. The Cassandra project has a good article on how to get performance with a journal. The key thing is to make sure of, is that the journal only records positive actions (my first approach was to write the pre-image of each write to the journal allowing you to rollback, but it got overly complicated).
So basically your memory-mapped file has a transactionId in the header, if your header fits into one block you know it won't get corrupted, though many people seem to write it twice with a checksum: [header[cksum]] [header[cksum]]. If the first checksum fails, use the second.
The journal looks something like this:
[beginTxn[txnid]] [offset, length, data...] [commitTxn[txnid]]
You just keep appending journal records until it gets too big, then roll it over at some point. When you startup your program you check to see if the transaction id for the file is at the last transaction id of the journal -- if not you play back all the transactions in the journal to sync up.

If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
In the general case, the OS does not guarantee "writes of a block" done with "normal IO APIs" are atomic:
Blocks are more of a filesystem concept - a filesystem's block size may actually map to multiple disk sectors...
Assuming you meant sector, how do you know your write only mapped to a sector? There's nothing saying the I/O was well aligned to that of a sector when it's gone through the indirection of a filesystem
There's nothing saying your disk HAS to implement sector atomicity. A "real disk" usually does but it's not mandatory or a guaranteed property. Sadly your program can't "check" for this property unless its an NVMe disk and you have access to the raw device or you're sending raw commands that have atomicity guarantees to a raw device.
Further, you're usually concerned with durability over multiple sectors (e.g. if power loss happens was the data I sent before this sector definitely on stable storage?). If there's any buffering going on, your write may have still only been in RAM/disk cache unless you used another command to check first / opened the file/device with flags requesting cache bypass and said flags were actually honoured.

Related

Does prefetching a write ever affect single core performance?

Some architectures have a "prefetch write" instruction to indicate to the CPU that you're going to be writing to a memory location before you actually do it. I understand that on a multicore machine this can be used by the core as a hint that it should try to get ownership of the given cache line now so that it can write to the location more quickly later. However, AFAICT that should only matter in situations when there are two cores potentially contending for the cache line. For a cache line that's only being read and written by a single core, does a prefetch write ever have any use?
All else being equal, Prefetch-Write has no benefit over Prefetch-Read for lines accessed only by a single core. After any kind of prefetch, the core will own the line in the Exclusive state. On a subsequent write, the line changes to the Modified state. An Exclusive-to-Modified transition is free since by definition no other core has the line. The E->M state change completes locally without snooping.
Beware that cores have their own hardware prefetch logic. An access to a line may cause the core to grab adjacent line(s) automatically. If global variables or other data reside nearby, an SMP system may experience a lot of unexpected cross-snooping.
I'd think that it could help if the cache line isn't in memory and the write prefetch flags that it will be needed some cycles from now. Housekeeping chores such as freeing up a line for the write could be more out of the way. Surely this should allow the CPU to complete the write faster than if it simply unloaded a write out of the blue into the lap of the cache?
Or have I missed something fundamental?

How does MongoDB journaling work

Here is my view, and I am not sure if it is right or wrong:
The journaling log is the "redo" log. It records the modification of the data files.
For example, I want to change the field value of one record from 'a' to 'b', then the mongodb will find how to modify the dbfile (include all the namespace, data, index and so on), then mongodb write the modifications to the journal.
After that, mongodb does all the real modifications to the dbfile. If something goes wrong here, when mongoDB restarts it will read the journal (if it exists). It will then change the alter the dbfile to make the data set consistent.
So, in the journal, the data to change is not recorded, but instead how to change the dbfile.
Am I right? where can I get more information about the journal's format?
EDIT: my original link to a 2011 presentation at MongoSF by Dwight is now dead, but there is a 2012 presentation by Ben Becker with similar content.
Just in case that stops working at some point too, I will give a quick summary of how the journal in the original MMAP storage engine worked, but it should be noted that with the advent of the pluggable storage engine model (MongoDB 3.0 and later), this now completely depends on the storage engine (and potentially the options) you are using - so please check.
Back to the original (MMAP) storage engine journal. At a very rudimentary level, the journal contains a series of queued operations and all operations are written into it as they happen - basically an append only sequential write to disk.
Once these operations have been applied and flushed to disk, then they are no longer needed in the journal and can be aged out. In this sense the journal basically acts like a circular buffer for write operations.
Internally, the operations in the journal are stored in "commit groups" - a logical group of write operations. Once an operation is in a complete commit group it can be considered to be synced to disk as part of the journal (and will satisfy the j:true write concern for example). After an unclean shutdown, mongod will attempt to apply all complete commit groups that have not previously been flushed to disk, incomplete commit groups will be discarded.
The operations in the journal are not what you will see in the oplog, rather they are a more simple set of files, offsets (disk locations essentially), and data to be written at the location. This allows for efficient replay of the data, and for a compact format for the journal, but will make the contents look like gibberish to most (as opposed to the aforementioned oplog which is basically readable as JSON documents). This basically answers one of the questions posed - it does not have any awareness of the database file's contents and the changes to be made to it, it is even more simple - it basically only knows to go to disk location X and write data Y, that's it.
The write-ahead, sequential nature of the journal means that it fits nicely on a spinning disk and the sequential access pattern will usually be at odds with the MMAP data access patterns (though not necessarily the access patterns of other engines). Hence it is sometimes a good idea to put the journal on its own disk or partition to reduce IO contention.

Why can't DMBSes rely on the OS buffer pool?

Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
Often, DBMS's have a really good idea about its upcoming access patterns, and it can't communicate these patterns to the kernel. This can lead to lower performance.
The buffer cache is traditional stored in a fixed-size kernel memory range, so it cannot grow or shrink. That also means the cache is much smaller than main memory, so by using the buffer cache a DBMS would be unable to take advantage of system resources.
I know this is old, but it came up as unanswered.
Essentially:
The OS uses a separate address spaces for every process.
Retrieving information from any other address space requires a system call or page fault. **(see below)
The DBMS is a process with its own address space.
The OS buffer pool Stonebraker describes is in the kernel address space.
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
However, many DBMSs including INGRES
[20] and System R [4] choose to put a
DBMS managed buffer pool in user space
to reduce overhead. Hence, each of
these systems has gone to the
trouble of constructing its own
buffer pool manager to enhance
performance.
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."

mmap writes to file on disk(synchronous/asynchronous)

I havea question regarding mmap functionality. when mmap is used in asynchronous mode where the kernel takes care of persisting the data to the mapped file on the disk , is it possible to have the former updates overwrite the later updates ?
Lets say at time T, we modify a location in memory that is memory mapped to a file on disk and again at time T+1 we modify the same location in memory. As the writes to the file are not synchronous, is it possible that kernel first picks up the modifications at time T+1 and then picks up the modifications at time T resulting in inconsistency in the memory mapped file ?
It's not exactly possible. The file is allowed to be inconsistent till msync(2) or munmap(2) - when that happens, dirty (modified) pages are written to disk page by page (sometimes more, depends on filesystem in newer kernels). msync() allows you to specify synchronous operation and invalidation of caches after finished write, which allows you to ensure that the data in cache is the same as data in file. Without that, it's possible that your program will see newer data but file contains older - exact specifics of the rather hairy situation depend on CPU architecture and specific OS implementation of those routines.

Do any common OS file systems use hashes to avoid storing the same content data more than once?

Many file storage systems use hashes to avoid duplication of the same file content data (among other reasons), e.g., Git and Dropbox both use SHA256. The file names and dates can be different, but as long as the content gets the same hash generated, it never gets stored more than once.
It seems this would be a sensible thing to do in a OS file system in order to save space. Are there any file systems for Windows or *nix that do this, or is there a good reason why none of them do?
This would, for the most part, eliminate the need for duplicate file finder utilities, because at that point the only space you would be saving would be for the file entry in the file system, which for most users is not enough to matter.
Edit: Arguably this could go on serverfault, but I feel developers are more likely to understand the issues and trade-offs involved.
ZFS supports deduplication since last month: http://blogs.oracle.com/bonwick/en_US/entry/zfs_dedup
Though I wouldn't call this a "common" filesystem (afaik, it is currently only supported by *BSD), it is definitely one worth looking at.
It would save space, but the time cost is prohibitive. The products you mention are already io bound, so the computational cost of hashing is not a bottleneck. If you hashed at the filesystem level, all io operations which are already slow will get worse.
NTFS has single instance storage.
NetApp has supported deduplication (that's what its called in the storage industry) in the WAFL filesystem (yeah, not your common filesystem) for a few years now. This is one of the most important features found in the enterprise filesystems today (and NetApp stands out because they support this on their primary storage also as compared to other similar products which support it only on their backup or secondary storage; they are too slow for primary storage).
The amount of data which is duplicate in a large enterprise with thousands of users is staggering. A lot of those users store the same documents, source code, etc. across their home directories. Reports of 50-70% data deduplicated have been seen often, saving lots of space and tons of money for large enterprises.
All of this means that if you create any common filesystem on a LUN exported by a NetApp filer, then you get deduplication for free, no matter what the filesystem created in that LUN. Cheers. Find out how it works here and here.
btrfs supports online de-duplication of data at the block level. I'd recommend duperemove as an external tool is needed.
It would require a fair amount of work to make this work in a file system. First of all, a user might be creating a copy of a file, planning to edit one copy, while the other remains intact -- so when you eliminate the duplication, the hard link you created that way would have to give COW semantics.
Second, the permissions on a file are often based on the directory into which that file's name is placed. You'd have to ensure that when you create your hidden hard link, that the permissions were correctly applied based on the link, not just the location of the actual content.
Third, users are likely to be upset if they make (say) three copies of a file on physically separate media to ensure against data loss from hardware failure, then find out that there was really only one copy of the file, so when that hardware failed, all three copies disappeared.
This strikes me as a bit like a second-system effect -- a solution to a problem long after the problem ceased to exist (or at least matter). With hard drives current running less than $100US/terabyte, I find it hard to believe that this would save most people a whole dollar worth of hard drive space. At that point, it's hard to imagine most people caring much.
There are file systems that do deduplication, which is sort of like this, but still noticeably different. In particular, deduplication is typically done on a basis of relatively small blocks of a file, not on complete files. Under such a system, a "file" basically just becomes a collection of pointers to de-duplicated blocks. Along with the data, each block will typically have some metadata for the block itself, that's separate from the metadata for the file(s) that refer to that block (e.g., it'll typically include at least a reference count). Any block that has a reference count greater than 1 will be treated as copy on write. That is, any attempt at writing to that block will typically create a copy, write to the copy, then store the copy of the block to the pool (so if the result comes out the same as some other block, deduplication will coalesce it with the existing block with the same content).
Many of the same considerations still apply though--most people don't have enough duplication to start with for deduplication to help a lot.
At the same time, especially on servers, deduplication at a block level can serve a real purpose. One really common case is dealing with multiple VM images, each running one of only a few choices of operating systems. If we look at the VM image as a whole, each is usually unique, so file-level deduplication would do no good. But they still frequently have a large chunk of data devoted to storing the operating system for that VM, and it's pretty common to have many VMs running only a few operating systems. With block-level deduplication, we can eliminate most of that redundancy. For a cloud server system like AWS or Azure, this can produce really serious savings.