mmap writes to file on disk(synchronous/asynchronous) - mmap

I havea question regarding mmap functionality. when mmap is used in asynchronous mode where the kernel takes care of persisting the data to the mapped file on the disk , is it possible to have the former updates overwrite the later updates ?
Lets say at time T, we modify a location in memory that is memory mapped to a file on disk and again at time T+1 we modify the same location in memory. As the writes to the file are not synchronous, is it possible that kernel first picks up the modifications at time T+1 and then picks up the modifications at time T resulting in inconsistency in the memory mapped file ?

It's not exactly possible. The file is allowed to be inconsistent till msync(2) or munmap(2) - when that happens, dirty (modified) pages are written to disk page by page (sometimes more, depends on filesystem in newer kernels). msync() allows you to specify synchronous operation and invalidation of caches after finished write, which allows you to ensure that the data in cache is the same as data in file. Without that, it's possible that your program will see newer data but file contains older - exact specifics of the rather hairy situation depend on CPU architecture and specific OS implementation of those routines.

Related

Append-only file write with fsync, on emmc/SSD/sdcard, ext4 or f2fs?

I am building two operating systems for IoT. The Libertas OS and Hornet OS.
The data APIs are designed to be append-only time series. fsync() is required after each append of byte block to ensure data safety.
The storage could be emmc, SSD, or sdcard. The question is, which filesystem is a better fit for different storage types?
I understand f2fs is designed as append-only. But what about EXT4? Couldn't easily find information about it.
Theoretically, at least for file content, the append shall continue writing on the current underlying block to minimize wearing. Since the file size is changed after append, the file meta-data shall be updated, ideally through append-log.
I also don't know the details of the internal controller of sdcard and emmc, will the controller honor such block level append?
Any insight will be greatly appreciated!

Why data is fetched from main memory in Write Allocate cache policy

For write allocate policy of a cache, when a write miss occurs, data is fetched from main memory and then updated with a write hit.
My question is, assuming write back policy on write hits, why the data is read from the main memory if it is immediately being updated by the CPU? Can't we just write to the cache without fetching the data from main memory?
On a store that hits in L1d cache, you don't need to fetch or RFO anything because the line is already exclusively owned.
Normally you're only storing to one part of the full line, thus you need a copy of the full line to have it in Modified state. You need to do a Read For Ownership (RFO) if you don't already have a valid Shared copy of the line. (Which you could promote to Exclusive and then Modified via just invalidating other copies. MESI).
A full-line store (like x86 AVX-512 vmovdqa [rdi], zmm0 64-byte store) can just invalidate instead of Read For Ownership, and just wait for an acknowledgement that no other cores have a valid copy of the line. IDK if that actually happens for AVX-512 stores specifically in current x86 microarchitectures.
Skipping the read (and just invalidating any other copies) definitely does happen in practice in some CPUs in some cases. e.g. in the store protocol used by microcode to implement x86 rep stos and rep movs, which are basically memset / memcpy. So for large counts they are definitely storing full lines, and it's worth it to avoid the memory traffic of reading first. See Andy Glew's comments, which I quotes in What setup does REP do? - he designed P6's (Pentium Pro) fast-strings microcode when he was working at Intel, and says it included a no-RFO store protocol.
See also Enhanced REP MOVSB for memcpy

Who place the data onto the cache?

Like we have locality of reference on which basis this data is copied to cache but who is responsible for this.
Is there any h/w or is there any s/f which perform this action?
The CPU reads/writes data into the cache when an instruction that access the memory is executed.
So it's an on-demand service, data is moved upon a request.
It then try to keep the data in the cache as long as possible until there is no more space and a replacement policy is used to evict a line in favor of new data.
The minimal unit of data transferred is called line and it is usually bigger than the register size (to improve locality).
Some CPUs have a prefetcher that, upon recognition of specific memory access patterns, try to automatically move data into the cache before it is actually requested by the program.
Some architecture have instructions that performs as hints for the CPU to prefetch data from a specific address.
This let the software have a minimal control over the prefetching circuitry, however if the software wants to just move data into the cache it only has to read the data (the CPU will cache it, if caching is enabled in that region).

Memory usage of zfs for mapped files

I read the following on https://blogs.oracle.com/roch/entry/does_zfs_really_use_more
There is one peculiar workload that does lead ZFS to consume more
memory: writing (using syscalls) to pages that are also mmaped. ZFS
does not use the regular paging system to manage data that passes
through reads and writes syscalls. However mmaped I/O which is closely
tied to the Virtual Memory subsystem still goes through the regular
paging code . So syscall writting to mmaped pages, means we will keep
2 copies of the associated data at least until we manage to get the
data to disk. We don't expect that type of load to commonly use large
amount of ram
What does this mean exactly? does this mean that zfs will "uselessly" double cache any memory region that is backed by a memory mapped file? or does "using syscalls" mean writing using some other method of writing that I am not familiar with.
If so, am I better off keeping the working directories of files written this way on a ufs partition?
Does this mean that zfs will "uselessly" double cache any memory region that is backed by a memory mapped file?
Hopefully, no.
or does "using syscalls" mean writing using some other method of writing that I am not familiar with.
That method is just regular low level write(fd, buf, nbytes) system calls and similars and not what memory mapped files are designed to support: accessing file content just with reading / writing memory by using pointers, using the file data as a byte array or whatever.
If so, am I better off keeping the working directories of files written this way on a ufs partition?
No, unless memory mapped files that are also written to using system calls sum to a significant part of your RAM workload, which is quite unlikely to happen.
PS: Note that this blog is almost ten years old. There might have been changes in the implementation since that time.

Memory Mapped files and atomic writes of single blocks

If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
How do I achieve the same effect on a memory mapped file?
Memory mapped files are simply byte arrays, so if I modify the byte array, the operating system has no way of knowing when I consider a write "done", so it might (even if that is unlikely) swap out the memory just in the middle of my block-writing operation, and in effect I write half a block.
I'd need some sort of a "enter/leave critical section", or some method of "pinning" the page of a file into memory while I'm writing to it. Does something like that exist? If so, is that portable across common POSIX systems & Windows?
The technique of keeping a journal seems to be the only way. I don't know how this works with multiple apps writing to the same file. The Cassandra project has a good article on how to get performance with a journal. The key thing is to make sure of, is that the journal only records positive actions (my first approach was to write the pre-image of each write to the journal allowing you to rollback, but it got overly complicated).
So basically your memory-mapped file has a transactionId in the header, if your header fits into one block you know it won't get corrupted, though many people seem to write it twice with a checksum: [header[cksum]] [header[cksum]]. If the first checksum fails, use the second.
The journal looks something like this:
[beginTxn[txnid]] [offset, length, data...] [commitTxn[txnid]]
You just keep appending journal records until it gets too big, then roll it over at some point. When you startup your program you check to see if the transaction id for the file is at the last transaction id of the journal -- if not you play back all the transactions in the journal to sync up.
If I read and write a single file using normal IO APIs, writes are guaranteed to be atomic on a per-block basis. That is, if my write only modifies a single block, the operating system guarantees that either the whole block is written, or nothing at all.
In the general case, the OS does not guarantee "writes of a block" done with "normal IO APIs" are atomic:
Blocks are more of a filesystem concept - a filesystem's block size may actually map to multiple disk sectors...
Assuming you meant sector, how do you know your write only mapped to a sector? There's nothing saying the I/O was well aligned to that of a sector when it's gone through the indirection of a filesystem
There's nothing saying your disk HAS to implement sector atomicity. A "real disk" usually does but it's not mandatory or a guaranteed property. Sadly your program can't "check" for this property unless its an NVMe disk and you have access to the raw device or you're sending raw commands that have atomicity guarantees to a raw device.
Further, you're usually concerned with durability over multiple sectors (e.g. if power loss happens was the data I sent before this sector definitely on stable storage?). If there's any buffering going on, your write may have still only been in RAM/disk cache unless you used another command to check first / opened the file/device with flags requesting cache bypass and said flags were actually honoured.