What happens when a core write in its L1 cache while another core is having the same line in its L1 too? - cpu-cache

What happens when a core write in its L1 cache while another core is having the same line in its L1 too ?
Let say for an intel Skylake CPU.
How does the cache system preserve consistency ? Does it update in real time, does it stop one of the cores ?
What's the performance cost of continuously writing in same cache line with two cores ?

In general modern CPUs use some variant1 of the MESI protocol to preserve cache coherency.
In your scenario of an L1 write, the details depend on the existing state of the cache lines: is the cache line already in the cache of the writing core? In the other core, in what state is the cache line, e.g., has it been modified?
Let's take the simple case where the line isn't already in the writing core (C1), and it is in the "exclusive" state in the other core (C2). At the point where the address for the write is known, C1 will issue an RFO (request for ownership) transaction onto the "bus" with the address of the line and the other cores will snoop the bus and notice the transaction. The other core that has the line will then transition its line from the exclusive to the invalid state and the value of the value of the line will be provided to the requesting core, which will have it in the modified state, at which point the write can proceed.
Note that at this point, further writes to that line from the writing core proceed quickly, since it is in the M state which means no bus transaction needs to take place. That will be the case until the line is evicted or some other core requests access.
Now, there are a lot of additional details in actual implementations which aren't covered above or even in the wikipedia description of the protocol.
For example, the basic model involves a single private cache per CPU, and shared main memory. In this model, core C2 would usually provide the value of the shared line onto the bus, even though it has not modified it, since that would be much faster than waiting to read the value from main memory. In all recent x86 implementations, however, there is a shared last-level L3 cache which sits between all the private L2 and L1 caches and main memory. This cache has typically been inclusive so it can provide the value directly to C1 without needing to do a cache-to-cache transfer from C2. Furthermore, having this shared cache means that each CPU may not actually need to snoop the "bus" since the L3 cache can be consulted first to determine which, if any, cores actually have the line. Only the cores that have the line will then be asked to make a state transition. Kind of a push model rather than pull.
Despite all these implementation details, the basics are the same: each cache line has some "per core" state (even though this state may be stored or duplicated in some central place like the LLC), and this state atomically undergoes logical transitions that ensure that the cache line remains consistent at all times.
Given that background, here are some specific answers to you final two sub-questions:
Does it update in real time, does it stop one of the cores?
Any modern core is going to do this in real time, and also in parallel for different cache lines. That doesn't mean it is free! For example, in the description above, the write by C1 is stalled until the cache coherence protocol is complete, which is likely dozens of cycles. Contrast that with a normal write which takes only a couple cycles. There are also possible bandwidth issues: the requests and responses used to implement the protocol use shared resources that may have a maximum throughput; if the rate of coherence transactions passes some limit, all requests may slow down even if they are independent.
In the past, when there was truly a shared bus, there may have been some partial "stop the world" behavior in some cases. For example, the lock prefix for x86 atomic instructions is apparently named based on the lock signal that a CPU would assert on the bus while it was doing an atomic transaction. During that entire period other CPUs are not able to fully use the bus (but presumably they could still proceed with CPU-local instructions).
What's the performance cost of continuously writing in same cache line with two cores?
The cost is very high because the line will continuously ping-pong between the two cores as described above (at the end of the process described, just reverse the roles of C1 and C2 and restart). The exact details vary a lot by CPU and even by platform (e.g, a 2-socket configuration will change this behavior a lot), but basically are probably looking at a penalty of 10s of cycles per write versus a not-shared output of 1 write per cycle.
You can find some specific numbers in the answers to this question which covers both the "two threads on the same physical core" case and the "separate cores" case.
If you want more details about specific performance scenarios, you should probably ask a separate specific question that lays out the particular behavior you are interested in.
1 The variations on MESI often introduce new states, such as the "owned" state in MOESI or the "Forwarded" state in MESIF. The idea is usually to make certain transitions or usage patterns more efficient than the plain MESI protocol, but the basic idea is largely the same.

Related

Does Cache empty itself if idle for a long time?

Does cache memory refresh itself if doesn't encounter any instruction for a threshold amount of time?
What I mean is that suppose, I have a multi-core machine and I have isolated core on it. Now, for one of the cores, there was no activity for say a few seconds. In this case, will the last instructions from the instruction cache be flushed after a certain amount of time has passed?
I understand this can be architecture dependent but I am looking for general pointers on the concept.
If a cache is power-gated in a particular idle state and if it's implemented using a volatile memory technology (such as SRAM), the cache will lose its contents. In this case, to maintain the architectural state, all dirty lines must be written to some memory structure that will retain its state (such as the next level of the memory hierarchy). Most processors support power-gating idle states. For example, on Intel processors, in the core C6 and deeper states, the core is fully power-gated including all private caches. When the core wakes up from any of these states, the caches will be cold.
It can be useful in an idle state, for the purpose of saving power, to flush a cache but not power-gate it. The ACPI specification defines such a state, called C3, in Section 8.1.4 (of version 6.3):
While in the C3 state, the processor’s caches maintain state but the
processor is not required to snoop bus master or multiprocessor CPU
accesses to memory.
Later in the same section it elaborates that C3 doesn't require preserving the state of caches, but also doesn't require flushing it. Essentially, a core in ACPI C3 doesn't guarantee cache coherence. In an implementation of ACPI C3, either the system software would be required to manually flush the cache before having a core enter C3 or the hardware would employ some mechanism to ensure coherence (flushing is not the only way). This idle state can potentially save more power compared to a shallower states by not having to engage in cache coherence.
To the best of my knowledge, the only processors that implement a non-power-gating version of ACPI C3 are those from Intel, starting with the Pentium II. All existing Intel x86 processors can be categorized according to how they implement ACPI C3:
Intel Core and later and Bonnell and later: The hardware state is called C3. The implementation uses multiple power-reduction mechanisms. The one relevant to the question flushes all the core caches (instruction, data, uop, paging unit), probably by executing a microcode routine on entry to the idle state. That is, all dirty lines are written back to the closest shared level of the memory hierarchy (L2 or L3) and all valid clean lines are invalidated. This is how cache coherency is maintained. The rest of the core state is retained.
Pentium II, Pentium III, Pentium 4, and Pentium M: The hardware state is called Sleep in these processors. In the Sleep state, the processor is fully clock-gated and doesn't respond to snoops (among other things). On-chip caches are not flushed and the hardware doesn't provide an alternative mechanism that protects the valid lines from becoming incoherent. Therefore, the system software is responsible for ensuring cache coherence. Otherwise, Intel specifies that if a snoop request occurs to a processor that is transitioning into or out of Sleep or already in Sleep, the resulting behavior is unpredictable.
All others don't support ACPI C3.
Note that clock-gating saves power by:
Turning off the clock generation logic, which itself consumes power.
Turning off any logic that does something on each clock cycle.
With clock-gating, dynamic power is reduced to essentially zero. But static power is still consumed to maintain state in the volatile memory structures.
Many processors include at least one level of on-chip cache that is shared between multiple cores. The processor branded Core Solo and Core Duo (whether based on the Enhanced Pentium M or Core microarchitectures) introduced an idle state that implements ACPI C3 at the package-level where the shared cache may be gradually power-gate and restore (Intel's package-level states correspond to system-level states in the ACPI specification). This hardware state is called PC7, Enhanced Deeper Sleep State, Deep C4, or other names depending on the processor. The shared cache is much larger compared to the private caches, and so it would take much more time to fully flush. This can reduce the effectiveness of PC7. Therefore, it's flushed gradually (the last core of the package that enters CC7 performs this operation). In addition, when the package exits PC7, the shared cache is enabled gradually as well, which may reduce the cost of entering PC7 next time. This is the basic idea, but the details depend on the processor. In PC7, significant portions of the package are power-gated.
It depends on what you mean by "idle" - specifically whether being "idle" involves the cache being powered or not.
Caches usually consist of registers comprising cells of SRAM, which preserve the data stored in them as long as the cells are powered (in contrast to DRAM, which needs to be periodically refreshed). Peter alluded to this in his comment: if power is cut off, not even an SRAM cell can maintain its state and data is lost.

What does a 'Split' cache means. And how is it useful(if it is)?

I was doing a question on Computer Architecture and in it it was mentioned that the cache is a split cache, and no hazard what does this exactly means?
A summary and additional discussion can be found at: L1 caches usually have split design, but L2, L3 caches have unified design, why?.
Introduction
A split cache is a cache that consists of two physically separate parts, where one part, called the instruction cache, is dedicated for holding instructions and the other, called the data cache, is dedicated for holding data (i.e., instruction memory operands). Both of the instruction cache and data cache are logically considered to be a single cache, described as a split cache, because both are hardware-managed caches for the same physical address space at the same level of the memory hierarchy. Instruction fetch requests are handled only by the instruction cache and memory operand read and write requests are handled only by the data cache. A cache that is not split is called a unified cache.
The Harvard vs. von Neumann architecture distinction originally applies to main memory. However, most modern computer systems implement the modified Harvard architecture whereby the L1 cache implements the Harvard architecture and the rest of the memory hierarchy implements the von Neumann architecture. Therefore, in modern systems, the Harvard vs. von Neumann distinction mostly applies to the L1 cache design. That's why the split cache design is also called the Harvard cache design and the unified cache design is also called von Neumann. The Wikipedia article on the modified Harvard architecture discusses three variants of the architecture, of which one is the split cache design.
To my knowledge, the idea of the split cache design was first proposed and evaluated by James Bell, David Casasent, and C. Cordon Bell in their paper entitled An Investigation of Alternative Cache Organizations, which was published in 1974 in the IEEE TC journal (the IEEE version is a bit clearer). The authors found using a simulator that, for almost all cache capacities considered in the study, an equal split results in the best performance (see Figure 5). From the paper:
Typically, the best performance occurs with half of the cache devoted
to instructions and half to data.
They also provided a comparison with a unified cache design of the same capacity and their initial conclusion was that the split design has no advantage over the unified design.
As shown in Fig. 6, the performance of the best dedicated cache CUXD
(half allotted to instructions and half to data) in general is quite
similar to that of a homogeneous cache (CUX); the extra complexity of
a dedicated cache control is thus not justifiable.
It's not clear to me actually whether the paper evaluated the split design or a cache that is partitioned between instructions and data. One paragraph says:
Thus far, the cache memory has been assumed to be composed of
homogeneous cells. But conceivably a functionally specialized
partitioning of the cache could give higher performance. For example,
perhaps a cache devoted exactly half to instructions and half to data
would be more effective than a homogeneous one; alternatively, one
that holds just instructions could be better than one holding just
data. To test these hypotheses, the effects of dividing the cache into
sections dedicated to specific uses were investigated.
(This paragraph was formatted automatically by https://www.textfixer.com/tools/remove-white-spaces.php.)
It seems to me that the authors are talking about both the split and partitioned designs. But it's not clear what design was implemented in the simulator and how the simulator was configured for evaluation.
Note that the paper didn't discuss why the split design may have a better or worse performance than the unified design. Also note how the authors used the terms "dedicated cache" and "homogeneous cache." The terms "split" and "unified" appeared in later works, which I believe were first used by Alan Jay Smith in Directions for memory hierarchies and their components: research and development in 1978. But I'm not sure because the way Alan used these terms gives the impression that they are already well-known. It appears to me from Alan's paper that the first processor that used the split cache design was the IBM 801 around 1975 and probably the second processor was the S-1 (around 1976). It's possible that the engineers of these processors might have came up with the split design idea independently.
Advantages of the Split Cache Design
The split cache design was then extensively studied in the next two decades. See, for example, Section 2.8 of this highly influential paper. But it was quickly recognized that the split design is useful for pipelined processors where the instruction fetch unit and the memory access unit are physically located in different parts of the chip. With the unified design, it is impossible to place the cache simultaneously close to the instruction fetch unit and the memory unit, resulting in high cache access latency from one or both units. The split design enables us to place the instruction cache close to the instruction fetch unit and the data cache close to the memory unit, thereby simultaneously reducing the latencies of both. (See what it looks like in the S-1 processor in Figure 3 of this document.) This is the primary advantage of the split design over the unified design. This is also the crucial difference between the split design and the unified design that supports cache partitioning. That's why it makes to have a split data cache, as proposed in several research works, such as Cache resident data locality analysis and Partitioned first-level cache design for clustered microarchitectures.
Another advantage of the split design is that it allows instruction and data accesses to occur in parallel without contention. Essentially, a split cache can have double the bandwidth of a unified cache. This improves performance in pipelined processors because instruction and data accesses can occur in the same cycle in different stages of the pipeline. Alternatively, the bandwidth of a unified cache can be doubled or improved using multiple access ports or multiple banks. In fact, using two ports provides twice the bandwidth to the whole cache (in contrast, in the split design, the bandwidth is split in half between the instruction cache and the data cache), but adding another port is more expensive in terms of area and power and may impact latency. A third alternative to improve the bandwidth is by adding more wires to the same port so that more bits can be accessed in the same cycle, but this would probably be restricted to the same cache line (in contrast to the two other approaches). If the cache is off-chip, then the wires that connect it to the pipeline become pins and the impact of the number of wires on area, power, and latency become more significant.
In addition, processors that use a unified (L1) cache typically included arbitration logic that prioritizes data accesses over instruction accesses; this logic can be eliminated in the split design. (See the discussion on the Z80000 processor below for a unified design that avoids arbitration.) Similarly, if there is another cache level that implements the unified design, there will be a need for an arbitration logic at the L2 unified cache. Simple arbitration policies may reduce performance and better policies may increase area. [TODO: Add examples of policies.]
Another potential advantage is that the split design allows us to employ different replacement policies for the instruction cache and data cache that may be more suitable for the access patterns of each cache. All Intel Itanium processors use the LRU policy for the L1I and the NRU policy for the L1D (I know for sure that this applies to the Itanium 2 and later, but I'm not sure about the first Itanium). Moreover, starting with Itanium 9500, the L1 ITLB uses NRU but the L1 DTLB uses LRU. Intel didn't disclose why they decided to use different replacement policies in these processors. In general, It seems to me that it's uncommon for the L1I and L1D to use different replacement policies. I couldn't find a single research paper on this (all papers on replacement policies focus only on data or unified caches). Even for a unified cache, it may be useful for the replacement policy to distinguish between instruction and data lines. In a split design, a cache line fetched into the data cache can never displace a line in the instruction cache. Similarly, a line filled into the instruction cache can never displace a line in the data cache. This issue, however, may occur in the unified design.
The last sub-section of the section on the differences between the modified Harvard architecture and Harvard and von Neumann in the Wikipedia article mentions that the Mark I machine uses different memory technologies for the instruction and data memories. This made me think whether this can constitute as an advantage for the split design in modern computer systems. Here are some of the papers that show that this indeed the case:
LASIC: Loop-Aware Sleepy Instruction Caches Based on STT-RAM Technology: The instruction cache is mostly read-only, except when there is a miss, in which case the line must be fetched and filled into the cache. This means that, when using STT-RAM (or really any other NVRAM technology), the expensive write operations occur less frequently compared to using STT-RAM for the data cache. The paper shows that by using an SRAM loop cache (like the LSD in Intel processors) and an STT-RAM instruction cache, energy consumption can be significantly reduced, especially when a loop is being executed that fits entirely in the loop cache. The non-volatile property of STT-RAM enables the authors to completely power-gate the instruction cache without losing its contents. In contrast, with an SRAM instruction cache, the static energy consumption is much larger, and power-gating it results in losing its contents. There is, however, a performance penalty with the proposed design (compared to a pure SRAM cache hierarchy).
Feasibility exploration of NVM based I-cache through MSHR enhancements: This paper also proposes using STT-RAM for the instruction cache while the data cache and the L2 cache remain based on SRAM. There is no loop cache here. This paper instead targets the high write latency issue of STT-RAM, which is incurred when a line is filled in the cache. The idea is that when a requested line is received from the L2 cache, the L1 cache first buffers the line in the MSHR allocated for its request. The MSHRs are still SRAM-based. Then the instruction cache line can be fed into the pipeline directly from the MSHR without having to potentially stall until it gets written in the STT-RAM cache. Similar to the previous work, the proposed architecture improves energy consumption at the expense of reduced performance.
System level exploration of a STT-MRAM based level 1 data-cache: Proposes using STT-RAM for the L1 data cache while keeping all other caches SRAM-based. This reduces area overhead and energy consumption, but performance is penalized.
Loop optimization in presence of STT-MRAM caches: A study of performance-energy tradeoffs: Compares the energy consumption and performance of pure (only SRAM or only STT-RAM) and hybrid (the L2 and instruction cache are STT-RAM-based) hierarchies. The hybrid cache hierarchy a performance-energy tradeoff that is in between the pure SRAM and pure STT-RAM hierarchies.
So I think we can say that one advantage of the split design is that we can use different memory technologies for the instruction and data caches.
There are two other advantages, which will be discussed later in this answer.
Disadvantages of the Split Cache Design
The split design has its problems, though. First, the combined space of the instruction and data caches may not be efficiently utilized. A cache line that contains both instructions and data may exist in both caches at the same time. In contrast, in a unified cache, only a single copy of the line would exist in the cache. In addition, the size of the instruction cache and/or the data cache may not be optimal for all applications or different phases of the same application. Simulations have shown that a unified cache of the same total size has a higher hit rate (see the VSC paper discussed later). This is the primary disadvantage of the split design. (If there is a placement contention on a single cache set in the split design, this contention may still occur in the unified design and it may have a worse impact on performance. In such a scenario, the split design would have a lower overall miss rate.)
Second, self-modifying code leads to consistency issues that need to be considered at the microarchitecture-level and/or software-level. (An inconsistency may be allowed between the two caches for a small number of cycles, but if the ISA does not allow such inconsistencies to be observable, they have to be detected before the instruction that got modified permanently changes the architectural state.) Maintaining instruction consistency requires more logic and has a higher performance impact in the split design than the unified one.
Third, the design and hardware complexity of a split cache compared against a single-ported unified cache, a fully dual-ported unified cache, and dual-ported banked cache of the same overall organization parameters is an important consideration. According to the cache area model proposed in CACTI 3.0: An Integrated Cache Timing, Power, and Area Model, the fully dual-ported design has the biggest area. This holds true irrespective of the types of the two ports (exclusive-read, exclusive-write, read/write). The dual-ported banked cache has a higher area than the single-ported unified cache. How these two compare against split is less obvious to me. My understanding is that the split design has a higher area than the single-ported unified design [TODO: Explain why]. It may be important to consider the cache organization details, the lengths of the cache buses to the pipeline, and the process technology. One thing to note here is a single-ported instruction cache has a lower are than a single-ported data cache or unified cache because the instruction cache requires only an exclusive-read port while the others require a read/write port.
Unified L1 and Split L2 Caches in Real Processors
I'm not aware of any processor designed in the last 15 years that has a unified (L1) cache. In modern processors, the unified design is mostly used for higher-numbered cache levels, which makes sense because they are not directly connected to the pipeline. An interesting example where the L2 cache follows the split design is the Intel Itanium 2 9000 processor. This processor has a 3-level cache hierarchy where both the L1 and L2 caches are split and private to each core and the L3 cache is unified and shared between all the cores. The L2D and L2I caches are 256 KB and 1 MB in size, respectively. Later Itanium processors reduced the L2I size to 512 KB. The Itanium 2 9000 manual explains why the L2 was made split:
The separate instruction and data L2 caches provide more efficient
access to the caches compared to Itanium 2 processors where
instruction requests would contend against data accesses for L2
bandwidth against data accesses and potentially impact core execution
as well as L2 throughput.
.
.
.
The L3 receives requests from both the
L2I and L2D but gives priority to the L2I request in the rare case of
a conflict. Moving the arbitration point from the L1-L2 in the Itanium
2 processor to the L2-L3 cache greatly reduces conflicts thanks to the
high hit rates of the L2.
(I think "against data accesses" was written twice by mistake.)
The second paragraph from that quote mentions an advantage that I have missed earlier. A split L2 cache moves the data-instruction conflict point from the L2 to the L3. In addition, some/many requests that miss in the L1 caches may hit in the L2, thereby making contention at the L3 less likely.
By the way, the L2I and L2D in the Itanium 2 9000 both use the NRU replacement policy.
Unified L1 Cache Partitioning
James Bell et al. mentioned in their 1974 paper the idea of partitioning a unified cache between instructions and data. The only paper that I'm aware of that proposed and evaluated such a design is Virtually Split Cache: An Efficient Mechanism to Distribute Instructions and Data, which was published in 2013. The main disadvantage of the split design is that one of the L1 caches may be underutilized while the other may be over-utilized. A split cache doesn't allow one cache to essentially take space from the other when needed. It is for reason why the unified design has a low L1 miss rate than the overall miss rate of the split caches (as the paper shows using simulation). However, the combined impact on performance of the higher latency and lower miss rate still makes the system with the unified L1 cache slower than the one with the split cache.
The Virtually Split Cache (VSC) design is the middle point between the split and unified designs. The VSC dynamically partitions (way-wise) the L1 cache between instructions and data depending on demand. This enables better utilization of the L1 cache, similar to the unified design. However, the VSC has even a low miss rate because partitioning reduces potential space conflict between lines holding and instructions and lines holding data. According to the experimental results (all cache designs have the same overall capacity), even if the VSC has the same latency as the unified cache, the VSC has about the same performance as the split design on a single-core system and has a higher performance on a multi-core system because the lower miss rate results in less contention on accessing the shared L2 cache. In addition, in both the single-core and multi-core system configurations, the VSC reduces energy consumption due to the lower miss rate.
A VSC could have a lower latency than a unified cache. Although both are dual ported (to have the same bandwdith as the single-ported split cache), in the VSC design, only the interface needs to be dual ported because no part of the cache can be accessed more than once at the same time. (The paper doesn't explicitly say so, but I think the VSC allows the same line to be present in both partitions if it holds both instructions and data, so it still has the consistency problem that exists in the split design.) Assuming that each bank of the cache represents one cache way, then each bank can be single-ported in VSC. This leads to a simpler design (see: Fast quadratic increase of multiport-storage-cell area with port number) and may allow reducing the latency. Moreover, assuming that the different in latency between the unified design and the split design is small (because the instruction cache and data cache in the split design are physically close to each other), the VSC design can store instructions and data in banks that are physically close to where they are needed in the pipeline and support variable-latency access depending on how many banks are allocated for each. The larger the number of banks, the higher the latency, up to the latency of the unified design. This would require, however, a pipeline design that can handle such variable latency cache.
I think one important thing that this paper is missing is evaluating the VSC design with higher access latencies with respect to the split design (not just 2 cycles vs. 3 cycles). I think increasing the latency by even only one cycle would make VSC slower than split.
The Case of the Z80000 Unified Cache
The Zilog Z80000 processor has a scalar 6-stage pipeline with an on-chip single-ported unified cache. The cache is 16-way fully associative and sectored. Each stage of the pipeline takes at least two clock cycles (loads that miss in the cache and other complex instructions may take more cycles). Each pair of consecutive clock cycles constitutes a processor cycle. The cache design of the Z80000 has a number of unique properties that I've not seen anywhere else:
There can be up to two cache accesses in a single processor cycle, including up to one instruction fetch and up to one data access. However, the cache, despite of being unified and single-ported, is designed in such a way as to have no contention between instruction fetches and data accesses. The unified cache has an access latency of a single clock cycle (which is equal to half a processor cycle). In each processor cycle, an instruction fetch is performed in the first clock cycle and a data access is performed in the second clock cycle. There is no latency benefit from splitting the cache in this case and time-multiplexing accesses to the cache provides the same bandwidth and also the split design downsides don't exist. The full associativity minimizes space contention between instruction and data lines. This design was made possible by the small cache size and relatively shallow pipeline with respect to the cache latency.
The System Configuration Control Longword (SCCL) offers Cache Instruction (CI) and Cache Data (CD) control bits. If CI is 1, instruction fetches that miss in the cache can be filled in the cache. If CD is 1, data loads that miss in the cache can be filled in the cache. The cache uses a write-no-allocate policy so write misses never allocate in the cache. If both CI and CD are set to 1, the cache effectively works like a unified cache. If only one of the flags is 1, the cache effectively works like a data-only or instruction-only cache. Applications can tune these flags to improve performance.
This property is not relevant to the question, but I found it interesting. SCCL also offers Cache Replacement (CR) control bit. Setting this bit to zero disables replacement on a miss, so lines are never replaced. If all entries in a set are occupied and a load/fetch miss occurs in that set, the line is simply not filled in the cache.
The Cases of the R3000, 80486, and Pentium P5
I came across the following complementary question on SE Retrocomputing: Why did Intel abandon unified CPU cache?. There are a number of issues with the accepted answer on that question. I'll address these issues here and explain why the 80486 and Pentium caches were designed like that based on information from Intel.
The 80386 does have an external cache controller with an external unified cache. However, just because the cache is external doesn't necessarily mean that it's likely to be unified. Consider the R3000 processor, which was released three years after 80386 and is of the same generation as the 80486. The designers of R3000 opted for a large external cache instead of a small on-chip cache to improve performance according to Section 1.8 of PaceMips R3000 32-Bit, 25 MHz RISC CPU with Integrated Memory Management Unit. The first section of Chapter 1 of the R3000 Software Reference Manual says that the external cache uses the split design so that it can perform an instruction fetch and a read or write data access in the same "clock phase." It's not clear to me how this exactly works though. My understanding is that the external data and address buses are shared between the two caches and with memory as well. (Also, some of the address wires are used to provide cache line tags to the on-chip cache controller for tag matching.) Both caches are direct-mapped, maybe to achieve a single-cycle access latency. A unified external cache design with the same bandwdith, associativity, and capacity requires the cache to be fully dual-ported, or the VSC design could be used but VSC was invented many years later. Such a unified cache would be more expensive and may have a latency larger than the required single cycle to keep the pipeline filled with instructions.
Another issue with the linked answer from Retro is that just because the 80486 evolved directly from the 80386 doesn't necessarily mean that it has also use the unified design. According to the Intel paper titled The i486 CPU: executing instructions in one clock cycle, Intel evaluated both designs and deliberately chose to go for the unified on-chip design. Compared to the same-gen R3000, both processors have similar frequency ranges and the off-chip data width is 32 bits in both processors. However, the unified cache of the 80486 is much smaller than total cache capacity of the R3000 (up to 16KB vs. up to 256KB+256KB). On the other hand, being on-chip made it more feasible for the 80486 to have wider cache buses. In particular, the 80486 cache has a 16-byte instruction fetch bus, a 4-byte data load bus, and a 4-byte data load/store bus. The two data buses could be used at the same time to load a single 8-byte operand (double-precision FP operand or segment desc) in one access. The R3000 caches share a single 4-byte bus. The relatively small size of the 80486 cache may have allowed making it 4-way associative with a single-cycle latency. This means that a load instruction that hits in the cache can supply the data to a dependent instruction in the next cycle without any stalls. On the R3000, if an instruction depends on an immediately preceding load instruction, it has to stall for one cycle in the best-case scenario of a cache hit.
The 80486 cache is single-ported, but the instruction prefetch buffer and the wide 16-byte instruction fetch bus helps keeping contention between instruction fetches and data accesses to minimum. Intel mentions that simulation results show that the unified design provides a hit rate that is higher than that of a split cache enough to compensate for the bandwdith contention.
Intel explained in another paper titled Design of the Intel Pentium processor why they decided to change the cache in the Pentium to split. There are two reasons: (1) The 2-wide superscalar Pentium requires the ability to perform up to two data accesses in a single cycle, and (2) Branch prediction increases cache bandwdith demand. The paper doesn't mention whether Intel considered using a triple-ported banked unified cache, but they probably did and found out that it's not feasible at that time, so they went for a split cache with a dual-ported 8-banked data cache and a single-ported instruction cache. With today's fab technology, the triple-ported unified design may be better
Wider pipelines in later microarchitectures required higher parallelism at the data cache. Now we're at 4 64-byte ports in Sunny Cove.
Answering the Second Part of the Question
it was mentioned that the cache is a split cache, and no hazard what
does this exactly means?
It's probably about the structural hazard mentioned in Paul's comment. That is, a unified single-ported cache cannot be accessed by the instruction fetch unit and the memory unit at the time.

When CPU flush value in storebuffer to L1 Cache?

Core A writes value x to storebuffer, waiting invalid ack and then flushes x to cache. Does it wait only one ack or wait all acks ? And how does it konw how many acks in all CPUs ?
It isn't clear to me what you mean by "invalid ack", but let's assume you mean a snoop/invalidation originating from another core which is requesting ownership of the same line.
In this case, the stores in the store buffer are generally free to ignore such invalidations from other cores since the stores in the store buffer are not yet globally visible. The store only become globally visible when they commit to L1 at some point after they have retired. At this point1 the cache controller will make an RFO (request for ownership) of the associated line if it isn't already in the cache. It is essentially at this point that the store becomes globally visible. The L1 cache controller doesn't need to know how many other invalidations are in flight, because they are being mediated by some higher level components in the system as part of the MESI protocol, and when they get the line in the E state, they are guaranteed they are the exclusive owner.
In short, invalidations from other cores have little effect on stores in the store buffer2, since they become globally visible at a single point based on an RFO request. Is is loads that have executed that area more likely to be made by invalid activity on another core, especially on strongly platforms such as x86 which doesn't allow visible load-load reordering. The so-called MOB on x86, for example, is responsible for tracking whether invalidations potentially break the ordering rules.
RFO Response
Perhaps the "acks" you were talking about are the responses from other cores to the writing core's request to obtain or upgrade its ownership of the line so that it can write to it: i.e., invaliding copies of the lines in the other CPUs and so on.
This is commonly known as issuing an RFO which when successful leaves the line in the E state in the requesting core.
Most CPUs are layered, with a variety of different agents working together to ensure coherency. In practice, this means that a CPU doens't need to wait for up to N-1 "acks" from the other N-1 cores on an N CPU system, but rather just a single reply from a higher-level component which itself is in charge of sending and collecting responses from other CPUs.
One example could be a single-socket multi-core CPU with a private L1 and L2, and shared L3. A core might send its RFO down to the L3, which might send invalidate requests to all cores, wait for their responses and then acknowledge the RFO request to the requesting core. Alternately, the L3 may store some bits which indicate which cores could possibly have a copy of the line, and then it only needs to send the requests to those cores (the role the L3 is taking in that case is sometimes referred to as a snoop filer).
Since all communication between agents passes through the L3, it is able to keep anything consistent. In the case of a multi-socket system, things get more complicated: the L3 on the local core may again get the request and may pass it over to the other socket to do the same type of invalidation there. Again there might exist the concept of a snoop filter, or other concepts may exist and the behavior may even be configurable!
For example, in Intel's Broadwell Xeon architecture, there are fully four different configurable snoop modes:
Broadwell offers four different snoop modes a reintroduction of Home
Snoop with Directory and Opportunistic Snoop Broadcast (HS with DIR +
OSB) previously available on Ivy Bridge, and three snoop modes that
were available on Haswell, Early Snoop, Home Snoop, and Cluster on Die
Mode (COD). Table 5 maps the memory bandwidth and latency trade-offs
that will vary across each of the different modes. Most workloads will
find that Home Snoop with Directory and Opportunistic Snoop Broadcast
will be the best choice.
... with different performance tradeoffs:
The rest that document goes into some detail about how the various modes work.
So I guess the short answer is "it's complicated and depends on the detailed design and possibly even user-configurable settings".
1 Or potentially at some earlier point since an optimized implementation might "look ahead" in the store buffer and issue RFOs (so-called "RFO prefetches") for upcoming stores even before they become the most senior store.
2 Invalidations may, however, complicate the RFO prefetches mentioned in the first footnote, since it means there is a window where line can be "stolen back" by another core, making the RFO prefetch wasted work. A sophisticated implementation may have a predictor that varies the RFO prefetch aggressiveness based on monitoring whether this occurs.

MSI: Why do we need to write the line back when other CPU is going to override it?

In the book "Computer Architecture", by Hennessy/Patterson, 5th ed, on page 360 they describe MSI protocol, and write something like:
If the line is in state "Exclusive" (Modified), then on receiving "Write Miss" from the bus the current CPU 1) writes back the line into the bus, and then 2) goes into "Invalid" state.
Why do we need to write-back the line, if it will be overwritten anyway by the successive write by the other CPU?
Is it connected with the fact that every CPU should see the same writes? (but I don't see why is it a problem not see this particular write by some other CPU)
Here is the protocol from their book (question in green, in purple it is clear: we need to write-back in order to supply the line to requesting CPU):
Writing back the modified data to memory is not strictly necessary in a MSI protocol. The state diagrams also seem to assume a system with low cost memory access (data is supplied by memory even when found in the shared state in another cache) and a shared bus connecting to the memory interface.
However, the modified data cannot simply be dropped as in shared state since the requesting processor might only be modifying part of the cache block (e.g., only one byte). Whatever portions of the block that are not modified by the requesting processor must still be available either in memory or at the requesting processor (the other processor has already invalidated its copy). With a shared bus and low cost memory access the cost difference of adding a write-back to memory over just communicating the data to the other processor is small.
In addition, even on a word-addressed system with word-sized cache blocks, keeping the old data available allows the write miss request to be sent speculatively (as with out-of-order execution or prefetch-for-write) without correctness issues.
(Per-byte tracking of modified [or valid as a superset of modified] state would allow some data communication to be avoided at the cost of extra state bits and a more complex communication system.)

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.