If I am writing 100 bytes at a time upto 1KB in a loop using mmap, it can get flushed in the midway or in the end. Since mmap doesn't give strong guarantee to flush the contents in disk. The only guarantee, If system doesn't reboot it will get flushed eventually. But My question is regarding any mmap call order preservation. Say if the flush do happen midway will the sequence of data written through mmap be preserved until the point of flush ?
More specifically -
Suppose mmap write to page1 then page2 to the mmap. even kernel is aware page2 is updated after page1, but kernel still may flush page2 before page1 to the disk. Is this possible
Related
I understood the process of swapping but have a question about the swap space.
As far as I know, if I executes a program, the main memory fetches data from the disk because there is no data in cache and memory. Then what about swap space? Is swap space used only as a back-up storage when page swapping out is needed? or does the program place all of data to swap space when it is executed so that when there is a page fault, it swaps in?
Swapping has largely disappeared. However, M$ has recently reintroduced it to Windoze for certain processes.
In Swapping, the entire process is moved out of memory and stored on disk. The operating system uses swapping to make room in memory for other processes. In the days of 64K address spaces, transferring a process between memory and disk was not that time consuming.
Swapping has been replaced largely by PAGING whereby individual pages of memory are moved to secondary storage rather than the entire process.
This may depends on the OS, but in general as I understand that when there a page fault (the desired page is not in main memory) occurs OS will instruct CPU to read the page from disk, and I am wondering does OS dispatch to another process while the disk I/O ? if it does then there will be a complete flushing of the TLB on a context switch, correct ?
More or less, but a page fault doesn't always mean the page is on the disk (it could also not exist at all, be a lazy-allocation page, be a copy-on-write page that was written to, exist but be marked unreadable/unwritable, etc). But if that's how it is, it's probably going to schedule an other thread at least because disk IO takes approximately forever.
The amount of switching necessary depends on what it switches to, switching between threads from the same context doesn't imply a TLB flush. If a TLB flush is necessary, it's probably not a complete flush, because of global pages (so typically, you're not flushing out TLB entries for kernel pages). There is also PCID to avoid complete flushes (flushing can be limited to specified process context IDs), but that's quite recent, and tricky to use since there are only 4096 different IDs.
Process-specific pages are marked as non-global entries with nG(non-global) bit in TLB entry and also stores the pid(Address ID in ARM's terminology).
Now the article clearly lays out this concept.
"For non-global entries, when the TLB is updated and the entry is marked as non-global, a value is stored in the TLB entry in addition to the normal translation information. This value is called the Address Space ID (ASID), which is a number assigned by the OS to each individual task. Subsequent TLB look-ups only match on that entry if the current ASID matches with the ASID that is stored in the entry. This permits multiple valid TLB entries to be present for a particular page marked as non-global, but with different ASID values. In other words, we do not necessarily need to flush the TLBs when we context switch."
Source: https://developer.arm.com/documentation/den0024/a/The-Memory-Management-Unit/Context-switching
Using MongoDB (via PyMongo) in the default "acknowledged" write concern mode, is it the case that if I have a line that writes to the DB (e.g. a mapReduce that outputs a new collection) followed by a line that reads from the DB, the read will always see the changes from the write?
Further, is the above true for all stricter write concerns than "acknowledged," i.e. "journaled" and "replica acknowledged," but not true in the case of "unacknowledged"?
If the write has been acknowledged, it should have been written to memory, thus any subsequent query should get the current data. This won't work if you have a replica set and allow reads from secondaries.
Journaled writes are written to the journal file on disk, which protects your data in case of power / hardware failures, etc. This shouldn't have an impact on consistency, which is covered as soon as the data is in memory.
Any replica configuration in the write concern will ensure that writes need to be acknowledged by the majority / all nodes in the replica set. This will only make a difference if you read from replicas or to protect your data against unreachable / dead servers.
For example in case of WiredTiger database engine, there'll be a cache of pages inside memory that are periodically written and read from disk, depending on memory pressure. And, in case of MMAPV1 storage engine, there would be a memory mapped address space that would correspond to pages on the disk. Now, the secondary structure that's called a journal. And a journal is a log of every single thing that the database processes - notice that the journal is also in memory.
When does the journal gets written to the disk?
When the app request something to the mongodb server via a TCP connection - and the server is gonna process the request. And it's going to write it into the memory pages. But they may not write to the disk for quite a while, depending on the memory pressure. It's also going to update request into the journal. By default, in the MongoDB driver, when we make a database request, we wait for the response. Say an acknowledged insert/update. But we don't wait for the journal to be written to the disk. The value that represents - whether we're going to wait for this write to be acknowledged by the server is called w.
w = 1
j = false
And by default, it's set to 1. 1 means, wait for this server to respond to the write. By default, j equals false, and j which stands for journal, represents whether or not we wait for this journal to be written to be written to the disk before we continue. So, what are the implications of these defaults? Well, the implications are that when we do an update/insert - we're really doing the operation in memory and not necessarily to the disk. This means, of course, it's very fast. And periodically (every few seconds) the journal gets written to the disk. It won't be long, but during this window of vulnerability when the data has been written into the server's memory into the pages, but the journal has not yet been persisted to the disk, if the server crashed, we could lose the data. We also have to realize that, as a programmer just because the write came back as good and it was written successfully to the memory. It may not ever persist to disk if the server subsequently crashes. And whether or not this is the problem depends on the application. For some applications, where there are lots of writes and logging small amount of data, we might find that it's very hard to even keep up with the data stream, if we wait for the journal to get written to the disk, because the disk is going to be 100 times, 1,000 times slower than memory for every single write. But for other applications, we may find that it's completely necessary for us to wait for this to be journaled and to know that it's been persisted to the disk before we continue. So, it's really upto us.
The w and j value together are called write concern. They can be set in the driver, at the collection level, database level or a client level.
1 : wait for the write to be acknowledged
0 : do not wait for the write to be acknowledged
TRUE : sync to journal
FALSE : do not sync to journal
There are also other values for w as well that also have some significance. With w=1 & j=true we could make sure that those writes have been persisted to disk. Now, if the writes have been written to the journal, then what happens is if the server crashes, then even though the pages may not be written back to disk yet, on recovery, the server can look at the journal on the disk - the mongod process and recreate all the writes that were not yet persisted to the pages. Because, they've been written to the journal. So, that's why this gives us a greater level of safety.
I am using the using the flush all command to delete all the key/value pair on my Memcache server.
While the values get deleted, there are two things I can't understand when looking at the Memcache server through phpMemcachedAdmin
The memory usage is not reset to 0 after flushing it all. I still have 77% used and 22% wasted (just an example, but you get the spirit). How is that possible?
All the previous slab with the previous items are still there. For example, looking at a specific slab would show all the previous key/value pairs, despite the flush all command. How is that possible?
Thanks
This happens because memcache flushes on read, not on write. When you flush_all, that operations is designed for performance: it just means anything read beyond that time will be instantly expired, even though it is still in the cache. It just updates a single number and then checks that on each fetch.
This optimization significantly improves memcache performance. Imagine if several other people are searching or inserting at the same time as a flush.
I have a program that routinely uses massive arrays, where the memory is allocated using mmap
Does anyone know the typical overheads of allocating address space in large amounts before the memory is committed, either if allocating with MAP_NORESERVE or backing the space with a sparse file? It5 strikes me mmap can't be free since it must make page table entries for the allocated space. I want to have some idea of this overhead before implementing an algorithm I'm considering.
Obviously the answer is going to be platform dependent, im most interested in x64 linux, sparc solaris and sparc linux. I'm thinking that the availability of 1mb pages makes the overhead rather less on a sparc than x64.
The overhead of mmap depends on the way you use it. And it is typically negligible when you use it in an appropriate way.
In linux kernel, mmap operation can be divided into two parts:
look for a free address range that can hold the mapping
Create/enlarge vma struct in address space (mm_struct)
So allocate large amount of memory use mmap do not introduce more
overhead than small ones.
So you should allocate memory as larger as possible in each time. (avoid mutiple times of small mmap)
And you may provide the start address explicitly (if possible). This could save some time in kernel in looking for an large enough free space.
If your application is an multi-threaded program. You should avoid concurrent calls to mmap. That is because the address space is protected by a reader-writer lock and the mmap always takes the writer lock. mmap latency will be orders of magnitude greater in this case.
Moreover, mmap only create the mapping but not the page table. Pages are allocated in the page fault handler when being touched. Page fault handler would take the reader lock that protects address space and can also affects mmap performance.
In this case, you should always try to reuse your large array instead of munmap it and mmap again. (Avoid pagefaults)