Does it make any sense in flushing the CPU cache manually, if it is implemented as a write-through cache?
When a word is written at a write-through (WT) cache by a store instruction, it is also sent to the following level of the memory hierarchy (see cache entry at wikipedia). Hence, cache blocks at the WT cache are clean, that is, are coherent with their copies at the next level, and write-backs would not be necessary.
WT invalidations could be required in case of Direct Memory Acceses (DMA) that make cache contents stale, but, as far as I know, these are not manually operations, but OS or hardware driven.
Related to manually flushing, for example, according to the Intel Architecture Software Developer’s Manual (Volume 2, Instruction Set Reference):
WBINVD — Write Back and Invalidate Cache This instruction writes back all modified cache lines in the processor’s internal cache to main memory and invalidates (flushes) the internal caches.
So I think that, in case of a WT cache, this instruction just invalidates all the cache lines.
Related
In the middle of this page (https://github.com/ultraembedded/riscv), there is a block diagram about the core, I really do not know what is TCM doing in the same block with the Icache ? Is it an optional thing to be inside the CPU ?
Some embedded systems provide dedicated memory for code and/or for data. On some of these systems, Tightly-Coupled Memory serves as a replacement for the (instruction) cache, while on other such systems this memory is in addition to and along side a cache, applying to a certain portion of the address space. This dedicated memory may be on the chip of the processor.
This memory could be some kind of ROM or other memory that is initialized somehow prior to boot. In any case, TCM typically isn't backed by main memory, so doesn't suffer cache misses and the associated circuitry, usually also has high performance, like a cache when a hit occurs.
Some systems refer to this as Instruction Tightly Integrated Memory, ITIM, or Data Tightly Integrated Memory, DTIM.
When a system uses ITIM or DTIM, it performs more like a Harvard architecture than the Modified Harvard architecture of laptops and desktops.
The cache has no address space. CPU does not ask for data from the cache, it just asks for a data, then the memory controller first checks the cache if the data is present in the cache. If it is in the cache, data is fetched, if not then the controller checks the RAM. All processor does is ask for data, it does not care where the data came from. In the case of TCM, the CPU can directly write data to TCM and ask data from TCM since it has a specific address. Think of TCM as a RAM that is close to the CPU.
Oracle has this concept of direct reads, where a session reads data from a table directly into its session memory bypassing buffer cache. Is something similar possible in postgres? Does a session always gets data from shared buffer?
You are mixing up two things.
the kernel cache where the kernel caches files to serve reads and writes more efficiently
the database shared memory cache (shared buffers) where the database caches table and index blocks
All database use the latter (Oracle calls it “database buffer cache”), because without caching performance would be abysmal.
With direct I/O you avoid the kernel cache, that is, all read and write requests go directly to disk.
There is no way in PostgreSQL to use direct I/O.
However, it has been recognized that buffered I/O comes with its own set of problems (e.g., a write request may succeed, a sync request that tells the kernel to flush the data to disk may fail, but the next sync request for the same (unpersisted!) data may not return an error any more). Relevant people hold the opinion that it might be a good idea to move to direct I/O eventually to avoid having to deal with such problems, but that would be a major change, and I wouldn't hold my breath until it happens.
Does cache memory refresh itself if doesn't encounter any instruction for a threshold amount of time?
What I mean is that suppose, I have a multi-core machine and I have isolated core on it. Now, for one of the cores, there was no activity for say a few seconds. In this case, will the last instructions from the instruction cache be flushed after a certain amount of time has passed?
I understand this can be architecture dependent but I am looking for general pointers on the concept.
If a cache is power-gated in a particular idle state and if it's implemented using a volatile memory technology (such as SRAM), the cache will lose its contents. In this case, to maintain the architectural state, all dirty lines must be written to some memory structure that will retain its state (such as the next level of the memory hierarchy). Most processors support power-gating idle states. For example, on Intel processors, in the core C6 and deeper states, the core is fully power-gated including all private caches. When the core wakes up from any of these states, the caches will be cold.
It can be useful in an idle state, for the purpose of saving power, to flush a cache but not power-gate it. The ACPI specification defines such a state, called C3, in Section 8.1.4 (of version 6.3):
While in the C3 state, the processor’s caches maintain state but the
processor is not required to snoop bus master or multiprocessor CPU
accesses to memory.
Later in the same section it elaborates that C3 doesn't require preserving the state of caches, but also doesn't require flushing it. Essentially, a core in ACPI C3 doesn't guarantee cache coherence. In an implementation of ACPI C3, either the system software would be required to manually flush the cache before having a core enter C3 or the hardware would employ some mechanism to ensure coherence (flushing is not the only way). This idle state can potentially save more power compared to a shallower states by not having to engage in cache coherence.
To the best of my knowledge, the only processors that implement a non-power-gating version of ACPI C3 are those from Intel, starting with the Pentium II. All existing Intel x86 processors can be categorized according to how they implement ACPI C3:
Intel Core and later and Bonnell and later: The hardware state is called C3. The implementation uses multiple power-reduction mechanisms. The one relevant to the question flushes all the core caches (instruction, data, uop, paging unit), probably by executing a microcode routine on entry to the idle state. That is, all dirty lines are written back to the closest shared level of the memory hierarchy (L2 or L3) and all valid clean lines are invalidated. This is how cache coherency is maintained. The rest of the core state is retained.
Pentium II, Pentium III, Pentium 4, and Pentium M: The hardware state is called Sleep in these processors. In the Sleep state, the processor is fully clock-gated and doesn't respond to snoops (among other things). On-chip caches are not flushed and the hardware doesn't provide an alternative mechanism that protects the valid lines from becoming incoherent. Therefore, the system software is responsible for ensuring cache coherence. Otherwise, Intel specifies that if a snoop request occurs to a processor that is transitioning into or out of Sleep or already in Sleep, the resulting behavior is unpredictable.
All others don't support ACPI C3.
Note that clock-gating saves power by:
Turning off the clock generation logic, which itself consumes power.
Turning off any logic that does something on each clock cycle.
With clock-gating, dynamic power is reduced to essentially zero. But static power is still consumed to maintain state in the volatile memory structures.
Many processors include at least one level of on-chip cache that is shared between multiple cores. The processor branded Core Solo and Core Duo (whether based on the Enhanced Pentium M or Core microarchitectures) introduced an idle state that implements ACPI C3 at the package-level where the shared cache may be gradually power-gate and restore (Intel's package-level states correspond to system-level states in the ACPI specification). This hardware state is called PC7, Enhanced Deeper Sleep State, Deep C4, or other names depending on the processor. The shared cache is much larger compared to the private caches, and so it would take much more time to fully flush. This can reduce the effectiveness of PC7. Therefore, it's flushed gradually (the last core of the package that enters CC7 performs this operation). In addition, when the package exits PC7, the shared cache is enabled gradually as well, which may reduce the cost of entering PC7 next time. This is the basic idea, but the details depend on the processor. In PC7, significant portions of the package are power-gated.
It depends on what you mean by "idle" - specifically whether being "idle" involves the cache being powered or not.
Caches usually consist of registers comprising cells of SRAM, which preserve the data stored in them as long as the cells are powered (in contrast to DRAM, which needs to be periodically refreshed). Peter alluded to this in his comment: if power is cut off, not even an SRAM cell can maintain its state and data is lost.
In the book "Computer Architecture", by Hennessy/Patterson, 5th ed, on page 360 they describe MSI protocol, and write something like:
If the line is in state "Exclusive" (Modified), then on receiving "Write Miss" from the bus the current CPU 1) writes back the line into the bus, and then 2) goes into "Invalid" state.
Why do we need to write-back the line, if it will be overwritten anyway by the successive write by the other CPU?
Is it connected with the fact that every CPU should see the same writes? (but I don't see why is it a problem not see this particular write by some other CPU)
Here is the protocol from their book (question in green, in purple it is clear: we need to write-back in order to supply the line to requesting CPU):
Writing back the modified data to memory is not strictly necessary in a MSI protocol. The state diagrams also seem to assume a system with low cost memory access (data is supplied by memory even when found in the shared state in another cache) and a shared bus connecting to the memory interface.
However, the modified data cannot simply be dropped as in shared state since the requesting processor might only be modifying part of the cache block (e.g., only one byte). Whatever portions of the block that are not modified by the requesting processor must still be available either in memory or at the requesting processor (the other processor has already invalidated its copy). With a shared bus and low cost memory access the cost difference of adding a write-back to memory over just communicating the data to the other processor is small.
In addition, even on a word-addressed system with word-sized cache blocks, keeping the old data available allows the write miss request to be sent speculatively (as with out-of-order execution or prefetch-for-write) without correctness issues.
(Per-byte tracking of modified [or valid as a superset of modified] state would allow some data communication to be avoided at the cost of extra state bits and a more complex communication system.)
Or it would be faster to re-read that data from mapped memory once again, since the OS might implement its own cache?
The nature of data is not known in advance, it is assumed that file reads are random.
i wanted to mention a few things i've read on the subject. The answer is no, you don't want to second guess the operating system's memory manager.
The first comes from the idea that you want your program (e.g. MongoDB, SQL Server) to try to limit your memory based on a percentage of free RAM:
Don't try to allocate memory until there is only x% free
Occasionally, a customer will ask for a way to design their program so it continues consuming RAM until there is only x% free. The idea is that their program should use RAM aggressively, while still leaving enough RAM available (x%) for other use. Unless you are designing a system where you are the only program running on the computer, this is a bad idea.
(read the article for the explanation of why it's bad, including pictures)
Next comes from some notes from the author of Varnish, and reverse proxy:
Varnish Cache - Notes from the architect
So what happens with squids elaborate memory management is that it gets into fights with the kernels elaborate memory management, and like any civil war, that never gets anything done.
What happens is this: Squid creates a HTTP object in "RAM" and it gets used some times rapidly after creation. Then after some time it get no more hits and the kernel notices this. Then somebody tries to get memory from the kernel for something and the kernel decides to push those unused pages of memory out to swap space and use the (cache-RAM) more sensibly for some data which is actually used by a program. This however, is done without squid knowing about it. Squid still thinks that these http objects are in RAM, and they will be, the very second it tries to access them, but until then, the RAM is used for something productive.
Imagine you do cache something from a memory-mapped file. At some point in the future that memory holding that "cache" will be swapped out to disk.
the OS has written to the hard-drive something which already exists on the hard drive
Next comes a time when you want to perform a lookup from your "cache" memory, rather than the "real" memory. You attempt to access the "cache", and since it has been swapped out of RAM the hardware raises a PAGE FAULT, and cache is swapped back into RAM.
your cache memory is just as slow as the "real" memory, since both are no longer in RAM
Finally, you want to free your cache (perhaps your program is shutting down). If the "cache" has been swapped out, the OS must first swap it back in so that it can be freed. If instead you just unmapped your memory-mapped file, everything is gone (nothing needs to be swapped in).
in this case your cache makes things slower
Again from Raymon Chen: If your application is closing - close already:
When DLL_PROCESS_DETACH tells you that the process is exiting, your best bet is just to return without doing anything
I regularly use a program that doesn't follow this rule. The program
allocates a lot of memory during the course of its life, and when I
exit the program, it just sits there for several minutes, sometimes
spinning at 100% CPU, sometimes churning the hard drive (sometimes
both). When I break in with the debugger to see what's going on, I
discover that the program isn't doing anything productive. It's just
methodically freeing every last byte of memory it had allocated during
its lifetime.
If my computer wasn't under a lot of memory pressure, then most of the
memory the program had allocated during its lifetime hasn't yet been
paged out, so freeing every last drop of memory is a CPU-bound
operation. On the other hand, if I had kicked off a build or done
something else memory-intensive, then most of the memory the program
had allocated during its lifetime has been paged out, which means that
the program pages all that memory back in from the hard drive, just so
it could call free on it. Sounds kind of spiteful, actually. "Come
here so I can tell you to go away."
All this anal-rententive memory management is pointless. The process
is exiting. All that memory will be freed when the address space is
destroyed. Stop wasting time and just exit already.
The reality is that programs no longer run in "RAM", they run in memory - virtual memory.
You can make use of a cache, but you have to work with the operating system's virtual memory manager:
you want to keep your cache within as few pages as possible
you want to ensure they stay in RAM, by the virtue of them being accessed a lot (i.e. actually being a useful cache)
Accessing:
a thousand 1-byte locations around a 400GB file
is much more expensive than accessing
a single 1000-byte location in a 400GB file
In other words: you don't really need to cache data, you need a more localized data structure.
If you keep your important data confined to a single 4k page, you will play much nicer with the VMM; Windows is your cache.
When you add 64-byte quad-word aligned cache-lines, there's even more incentive to adjust your data structure layout. But then you don't want it too compact, or you'll start suffering performance penalties of cache flushes from False Sharing.
The answer is highly OS-specific. Generally speaking, there will be no sense in caching this data. Both the "cached" data as well as the memory-mapped can be paged away at any time.
If there will be any difference it will be specific to an OS - unless you need that granularity, there is no sense in caching the data.