Why are CAS (Atomic) operations faster than synchronized or volatile operations

Why are CAS (Atomic) operations faster than synchronized or volatile operations - atomic

From what I understand, synchronized keyword syncs local thread cache with main memory. volatile keyword basically always reads the variable from the main memory at every access. Of course accessing main memory is much more expensive than local thread cache so these operations are expensive. However, a CAS operation use low level hardware operations but still has to access main memory. So how is a CAS operation any faster?

I believe the critical factor is as you state - the CAS mechanisms use low-level hardware instructions that allow for the minimal cache flushing and contention resolution.
The other two mechanisms (synchronization and volatile) use different architectural tricks that are more generally available across all different architectures.
CAS instructions are available in one form or another in most modern architectures but there will be a different implementation in each architecture.
Interesting quote from Brian Goetz (supposedly)
The relative speed of the operations is largely a non-issue. What is relevant is the difference in scalability between lock-based and non-blocking algorithms. And if you're running on a 1 or 2 core system, stop thinking about such things.
Non-blocking algorithms generally scale better because they have shorter "critical sections" than lock-based algorithms.

Note that a CAS does not necessarily have to access memory.
Most modern architectures implement a cache coherency protocol like MESI that allows the CPU to do shortcuts if there is only one thread accessing the data at the same time. The overhead compared to traditional, unsynchronized memory access is very low in this case.
When doing a lot of concurrent changes to the same value however, the caches are indeed quite worthless and all operations need to access main memory directly. In this case the overhead for synchronizing the different CPU caches and the serialization of memory access can lead to a significant performance drop (this is also known as cache ping-pong), which can be just as bad or even worse than what you experience with lock-based approaches.
So never simply assume that if you switch to atomics all your problems go away. The big advantage of atomics are the progress guarantees for lock-free (someone always makes progress) or wait-free (everyone finishes after a certain number of steps) implementations. However, this is often orthogonal to raw performance: A wait-free solution is likely to be significantly slower than a lock-based solution, but in some situations you are willing to accept that in order to get the progress guarantees.

Related

Strict consistency vs atomic consistency

I have read a couple of articles and I am confused about the difference between strict consistency (which is defined as "It can be better understood as though a global clock is present in which every write should be reflected in all processor caches by the end of that clock period.") and atomic consistency (or linearizability, which is defined as "sequential consistency with the real-time constraint"). Both definitions come from Wikipedia. The source of my confusion is the fact that the strict model provides that every process see a change immediately and atomic consistency is also said to work in real-time providing the same sequence of writes for every process.

The requirement for maintaining a system-wide distribution of global clock for a strict consistency case is hopefully clean enough by itself.
The atomic consistency thus needs some more warranties, in exchange to not maintaining the global clock, to still become and stay consistent system-wide.
Here comes useful the warranty from HRT-system, as it keeps the sequential consistency within its realm of deterministic, a-priori known finite time. Thus the state-change propagation planning is possible and holds throughout the whole life-cycle of the HRT-system operation.
On "sequential consistency with the real-time constraint" :
This option ought be understood as a technically less strict, yet for maintaining the system-wide consistency-goal sufficient enough ( see determinism + known deadline below ), not having a need to guarantee a system-wide distribution of a uniform clock.
For touching what the "real-time constraint" actually is useful for, let me borrow ( incl. original typos, accents added ) from a book from Giovanni Di Sirio on Real-Time Operating System (RTOS) disambiguation :
What an RTOS is
An RTOS is an operating system whose internal processes are guaranteed to be compliant with (hard or soft) realtime requirements. The fundamental qualities of an RTOS are:
- Predictable. It is the quality of being predictable in the scheduling behavior.
- Deterministic. It is the quality of being able to consistently produce the same results under the same conditions.
RTOS are often confused with “fast” operating systems. While efficiency is a positive attribute of an RTOS, efficiency alone does not qualifies an OS as RTOS but it could separate a good RTOS from a not so good one.
A deciding factor is the (un-)certainty of completing each work-unit within a(n un-)known deadline :
“A non real time system is a system where the programmed reaction to an event will certainly happen sometime in the future”.
Whereas :
Soft Real Time.A Soft Real Time (SRT) system is a system where not meeting a deadline can have undesirable but not catastrophic effects, a performance degradation for example. Such systems could be described as follow :
“A soft real time system is a system where the programmed reaction to an event is almost always completed within a known finite time”.
Hard Real Time.An Hard Real Time (HRT) system is a system where not meeting a deadline can have catastrophic effects. Hard realtime systems require a much more strict definition and could be described as follow :
“An hard real time system is a system where the programmed reaction to an event is guaranteed to be completed within a known finite time”.

What does a 'Split' cache means. And how is it useful(if it is)?

I was doing a question on Computer Architecture and in it it was mentioned that the cache is a split cache, and no hazard what does this exactly means?

A summary and additional discussion can be found at: L1 caches usually have split design, but L2, L3 caches have unified design, why?.
Introduction
A split cache is a cache that consists of two physically separate parts, where one part, called the instruction cache, is dedicated for holding instructions and the other, called the data cache, is dedicated for holding data (i.e., instruction memory operands). Both of the instruction cache and data cache are logically considered to be a single cache, described as a split cache, because both are hardware-managed caches for the same physical address space at the same level of the memory hierarchy. Instruction fetch requests are handled only by the instruction cache and memory operand read and write requests are handled only by the data cache. A cache that is not split is called a unified cache.
The Harvard vs. von Neumann architecture distinction originally applies to main memory. However, most modern computer systems implement the modified Harvard architecture whereby the L1 cache implements the Harvard architecture and the rest of the memory hierarchy implements the von Neumann architecture. Therefore, in modern systems, the Harvard vs. von Neumann distinction mostly applies to the L1 cache design. That's why the split cache design is also called the Harvard cache design and the unified cache design is also called von Neumann. The Wikipedia article on the modified Harvard architecture discusses three variants of the architecture, of which one is the split cache design.
To my knowledge, the idea of the split cache design was first proposed and evaluated by James Bell, David Casasent, and C. Cordon Bell in their paper entitled An Investigation of Alternative Cache Organizations, which was published in 1974 in the IEEE TC journal (the IEEE version is a bit clearer). The authors found using a simulator that, for almost all cache capacities considered in the study, an equal split results in the best performance (see Figure 5). From the paper:
Typically, the best performance occurs with half of the cache devoted
to instructions and half to data.
They also provided a comparison with a unified cache design of the same capacity and their initial conclusion was that the split design has no advantage over the unified design.
As shown in Fig. 6, the performance of the best dedicated cache CUXD
(half allotted to instructions and half to data) in general is quite
similar to that of a homogeneous cache (CUX); the extra complexity of
a dedicated cache control is thus not justifiable.
It's not clear to me actually whether the paper evaluated the split design or a cache that is partitioned between instructions and data. One paragraph says:
Thus far, the cache memory has been assumed to be composed of
homogeneous cells. But conceivably a functionally specialized
partitioning of the cache could give higher performance. For example,
perhaps a cache devoted exactly half to instructions and half to data
would be more effective than a homogeneous one; alternatively, one
that holds just instructions could be better than one holding just
data. To test these hypotheses, the effects of dividing the cache into
sections dedicated to specific uses were investigated.
(This paragraph was formatted automatically by https://www.textfixer.com/tools/remove-white-spaces.php.)
It seems to me that the authors are talking about both the split and partitioned designs. But it's not clear what design was implemented in the simulator and how the simulator was configured for evaluation.
Note that the paper didn't discuss why the split design may have a better or worse performance than the unified design. Also note how the authors used the terms "dedicated cache" and "homogeneous cache." The terms "split" and "unified" appeared in later works, which I believe were first used by Alan Jay Smith in Directions for memory hierarchies and their components: research and development in 1978. But I'm not sure because the way Alan used these terms gives the impression that they are already well-known. It appears to me from Alan's paper that the first processor that used the split cache design was the IBM 801 around 1975 and probably the second processor was the S-1 (around 1976). It's possible that the engineers of these processors might have came up with the split design idea independently.
Advantages of the Split Cache Design
The split cache design was then extensively studied in the next two decades. See, for example, Section 2.8 of this highly influential paper. But it was quickly recognized that the split design is useful for pipelined processors where the instruction fetch unit and the memory access unit are physically located in different parts of the chip. With the unified design, it is impossible to place the cache simultaneously close to the instruction fetch unit and the memory unit, resulting in high cache access latency from one or both units. The split design enables us to place the instruction cache close to the instruction fetch unit and the data cache close to the memory unit, thereby simultaneously reducing the latencies of both. (See what it looks like in the S-1 processor in Figure 3 of this document.) This is the primary advantage of the split design over the unified design. This is also the crucial difference between the split design and the unified design that supports cache partitioning. That's why it makes to have a split data cache, as proposed in several research works, such as Cache resident data locality analysis and Partitioned first-level cache design for clustered microarchitectures.
Another advantage of the split design is that it allows instruction and data accesses to occur in parallel without contention. Essentially, a split cache can have double the bandwidth of a unified cache. This improves performance in pipelined processors because instruction and data accesses can occur in the same cycle in different stages of the pipeline. Alternatively, the bandwidth of a unified cache can be doubled or improved using multiple access ports or multiple banks. In fact, using two ports provides twice the bandwidth to the whole cache (in contrast, in the split design, the bandwidth is split in half between the instruction cache and the data cache), but adding another port is more expensive in terms of area and power and may impact latency. A third alternative to improve the bandwidth is by adding more wires to the same port so that more bits can be accessed in the same cycle, but this would probably be restricted to the same cache line (in contrast to the two other approaches). If the cache is off-chip, then the wires that connect it to the pipeline become pins and the impact of the number of wires on area, power, and latency become more significant.
In addition, processors that use a unified (L1) cache typically included arbitration logic that prioritizes data accesses over instruction accesses; this logic can be eliminated in the split design. (See the discussion on the Z80000 processor below for a unified design that avoids arbitration.) Similarly, if there is another cache level that implements the unified design, there will be a need for an arbitration logic at the L2 unified cache. Simple arbitration policies may reduce performance and better policies may increase area. [TODO: Add examples of policies.]
Another potential advantage is that the split design allows us to employ different replacement policies for the instruction cache and data cache that may be more suitable for the access patterns of each cache. All Intel Itanium processors use the LRU policy for the L1I and the NRU policy for the L1D (I know for sure that this applies to the Itanium 2 and later, but I'm not sure about the first Itanium). Moreover, starting with Itanium 9500, the L1 ITLB uses NRU but the L1 DTLB uses LRU. Intel didn't disclose why they decided to use different replacement policies in these processors. In general, It seems to me that it's uncommon for the L1I and L1D to use different replacement policies. I couldn't find a single research paper on this (all papers on replacement policies focus only on data or unified caches). Even for a unified cache, it may be useful for the replacement policy to distinguish between instruction and data lines. In a split design, a cache line fetched into the data cache can never displace a line in the instruction cache. Similarly, a line filled into the instruction cache can never displace a line in the data cache. This issue, however, may occur in the unified design.
The last sub-section of the section on the differences between the modified Harvard architecture and Harvard and von Neumann in the Wikipedia article mentions that the Mark I machine uses different memory technologies for the instruction and data memories. This made me think whether this can constitute as an advantage for the split design in modern computer systems. Here are some of the papers that show that this indeed the case:
LASIC: Loop-Aware Sleepy Instruction Caches Based on STT-RAM Technology: The instruction cache is mostly read-only, except when there is a miss, in which case the line must be fetched and filled into the cache. This means that, when using STT-RAM (or really any other NVRAM technology), the expensive write operations occur less frequently compared to using STT-RAM for the data cache. The paper shows that by using an SRAM loop cache (like the LSD in Intel processors) and an STT-RAM instruction cache, energy consumption can be significantly reduced, especially when a loop is being executed that fits entirely in the loop cache. The non-volatile property of STT-RAM enables the authors to completely power-gate the instruction cache without losing its contents. In contrast, with an SRAM instruction cache, the static energy consumption is much larger, and power-gating it results in losing its contents. There is, however, a performance penalty with the proposed design (compared to a pure SRAM cache hierarchy).
Feasibility exploration of NVM based I-cache through MSHR enhancements: This paper also proposes using STT-RAM for the instruction cache while the data cache and the L2 cache remain based on SRAM. There is no loop cache here. This paper instead targets the high write latency issue of STT-RAM, which is incurred when a line is filled in the cache. The idea is that when a requested line is received from the L2 cache, the L1 cache first buffers the line in the MSHR allocated for its request. The MSHRs are still SRAM-based. Then the instruction cache line can be fed into the pipeline directly from the MSHR without having to potentially stall until it gets written in the STT-RAM cache. Similar to the previous work, the proposed architecture improves energy consumption at the expense of reduced performance.
System level exploration of a STT-MRAM based level 1 data-cache: Proposes using STT-RAM for the L1 data cache while keeping all other caches SRAM-based. This reduces area overhead and energy consumption, but performance is penalized.
Loop optimization in presence of STT-MRAM caches: A study of performance-energy tradeoffs: Compares the energy consumption and performance of pure (only SRAM or only STT-RAM) and hybrid (the L2 and instruction cache are STT-RAM-based) hierarchies. The hybrid cache hierarchy a performance-energy tradeoff that is in between the pure SRAM and pure STT-RAM hierarchies.
So I think we can say that one advantage of the split design is that we can use different memory technologies for the instruction and data caches.
There are two other advantages, which will be discussed later in this answer.
Disadvantages of the Split Cache Design
The split design has its problems, though. First, the combined space of the instruction and data caches may not be efficiently utilized. A cache line that contains both instructions and data may exist in both caches at the same time. In contrast, in a unified cache, only a single copy of the line would exist in the cache. In addition, the size of the instruction cache and/or the data cache may not be optimal for all applications or different phases of the same application. Simulations have shown that a unified cache of the same total size has a higher hit rate (see the VSC paper discussed later). This is the primary disadvantage of the split design. (If there is a placement contention on a single cache set in the split design, this contention may still occur in the unified design and it may have a worse impact on performance. In such a scenario, the split design would have a lower overall miss rate.)
Second, self-modifying code leads to consistency issues that need to be considered at the microarchitecture-level and/or software-level. (An inconsistency may be allowed between the two caches for a small number of cycles, but if the ISA does not allow such inconsistencies to be observable, they have to be detected before the instruction that got modified permanently changes the architectural state.) Maintaining instruction consistency requires more logic and has a higher performance impact in the split design than the unified one.
Third, the design and hardware complexity of a split cache compared against a single-ported unified cache, a fully dual-ported unified cache, and dual-ported banked cache of the same overall organization parameters is an important consideration. According to the cache area model proposed in CACTI 3.0: An Integrated Cache Timing, Power, and Area Model, the fully dual-ported design has the biggest area. This holds true irrespective of the types of the two ports (exclusive-read, exclusive-write, read/write). The dual-ported banked cache has a higher area than the single-ported unified cache. How these two compare against split is less obvious to me. My understanding is that the split design has a higher area than the single-ported unified design [TODO: Explain why]. It may be important to consider the cache organization details, the lengths of the cache buses to the pipeline, and the process technology. One thing to note here is a single-ported instruction cache has a lower are than a single-ported data cache or unified cache because the instruction cache requires only an exclusive-read port while the others require a read/write port.
Unified L1 and Split L2 Caches in Real Processors
I'm not aware of any processor designed in the last 15 years that has a unified (L1) cache. In modern processors, the unified design is mostly used for higher-numbered cache levels, which makes sense because they are not directly connected to the pipeline. An interesting example where the L2 cache follows the split design is the Intel Itanium 2 9000 processor. This processor has a 3-level cache hierarchy where both the L1 and L2 caches are split and private to each core and the L3 cache is unified and shared between all the cores. The L2D and L2I caches are 256 KB and 1 MB in size, respectively. Later Itanium processors reduced the L2I size to 512 KB. The Itanium 2 9000 manual explains why the L2 was made split:
The separate instruction and data L2 caches provide more efficient
access to the caches compared to Itanium 2 processors where
instruction requests would contend against data accesses for L2
bandwidth against data accesses and potentially impact core execution
as well as L2 throughput.
.
.
.
The L3 receives requests from both the
L2I and L2D but gives priority to the L2I request in the rare case of
a conflict. Moving the arbitration point from the L1-L2 in the Itanium
2 processor to the L2-L3 cache greatly reduces conflicts thanks to the
high hit rates of the L2.
(I think "against data accesses" was written twice by mistake.)
The second paragraph from that quote mentions an advantage that I have missed earlier. A split L2 cache moves the data-instruction conflict point from the L2 to the L3. In addition, some/many requests that miss in the L1 caches may hit in the L2, thereby making contention at the L3 less likely.
By the way, the L2I and L2D in the Itanium 2 9000 both use the NRU replacement policy.
Unified L1 Cache Partitioning
James Bell et al. mentioned in their 1974 paper the idea of partitioning a unified cache between instructions and data. The only paper that I'm aware of that proposed and evaluated such a design is Virtually Split Cache: An Efficient Mechanism to Distribute Instructions and Data, which was published in 2013. The main disadvantage of the split design is that one of the L1 caches may be underutilized while the other may be over-utilized. A split cache doesn't allow one cache to essentially take space from the other when needed. It is for reason why the unified design has a low L1 miss rate than the overall miss rate of the split caches (as the paper shows using simulation). However, the combined impact on performance of the higher latency and lower miss rate still makes the system with the unified L1 cache slower than the one with the split cache.
The Virtually Split Cache (VSC) design is the middle point between the split and unified designs. The VSC dynamically partitions (way-wise) the L1 cache between instructions and data depending on demand. This enables better utilization of the L1 cache, similar to the unified design. However, the VSC has even a low miss rate because partitioning reduces potential space conflict between lines holding and instructions and lines holding data. According to the experimental results (all cache designs have the same overall capacity), even if the VSC has the same latency as the unified cache, the VSC has about the same performance as the split design on a single-core system and has a higher performance on a multi-core system because the lower miss rate results in less contention on accessing the shared L2 cache. In addition, in both the single-core and multi-core system configurations, the VSC reduces energy consumption due to the lower miss rate.
A VSC could have a lower latency than a unified cache. Although both are dual ported (to have the same bandwdith as the single-ported split cache), in the VSC design, only the interface needs to be dual ported because no part of the cache can be accessed more than once at the same time. (The paper doesn't explicitly say so, but I think the VSC allows the same line to be present in both partitions if it holds both instructions and data, so it still has the consistency problem that exists in the split design.) Assuming that each bank of the cache represents one cache way, then each bank can be single-ported in VSC. This leads to a simpler design (see: Fast quadratic increase of multiport-storage-cell area with port number) and may allow reducing the latency. Moreover, assuming that the different in latency between the unified design and the split design is small (because the instruction cache and data cache in the split design are physically close to each other), the VSC design can store instructions and data in banks that are physically close to where they are needed in the pipeline and support variable-latency access depending on how many banks are allocated for each. The larger the number of banks, the higher the latency, up to the latency of the unified design. This would require, however, a pipeline design that can handle such variable latency cache.
I think one important thing that this paper is missing is evaluating the VSC design with higher access latencies with respect to the split design (not just 2 cycles vs. 3 cycles). I think increasing the latency by even only one cycle would make VSC slower than split.
The Case of the Z80000 Unified Cache
The Zilog Z80000 processor has a scalar 6-stage pipeline with an on-chip single-ported unified cache. The cache is 16-way fully associative and sectored. Each stage of the pipeline takes at least two clock cycles (loads that miss in the cache and other complex instructions may take more cycles). Each pair of consecutive clock cycles constitutes a processor cycle. The cache design of the Z80000 has a number of unique properties that I've not seen anywhere else:
There can be up to two cache accesses in a single processor cycle, including up to one instruction fetch and up to one data access. However, the cache, despite of being unified and single-ported, is designed in such a way as to have no contention between instruction fetches and data accesses. The unified cache has an access latency of a single clock cycle (which is equal to half a processor cycle). In each processor cycle, an instruction fetch is performed in the first clock cycle and a data access is performed in the second clock cycle. There is no latency benefit from splitting the cache in this case and time-multiplexing accesses to the cache provides the same bandwidth and also the split design downsides don't exist. The full associativity minimizes space contention between instruction and data lines. This design was made possible by the small cache size and relatively shallow pipeline with respect to the cache latency.
The System Configuration Control Longword (SCCL) offers Cache Instruction (CI) and Cache Data (CD) control bits. If CI is 1, instruction fetches that miss in the cache can be filled in the cache. If CD is 1, data loads that miss in the cache can be filled in the cache. The cache uses a write-no-allocate policy so write misses never allocate in the cache. If both CI and CD are set to 1, the cache effectively works like a unified cache. If only one of the flags is 1, the cache effectively works like a data-only or instruction-only cache. Applications can tune these flags to improve performance.
This property is not relevant to the question, but I found it interesting. SCCL also offers Cache Replacement (CR) control bit. Setting this bit to zero disables replacement on a miss, so lines are never replaced. If all entries in a set are occupied and a load/fetch miss occurs in that set, the line is simply not filled in the cache.
The Cases of the R3000, 80486, and Pentium P5
I came across the following complementary question on SE Retrocomputing: Why did Intel abandon unified CPU cache?. There are a number of issues with the accepted answer on that question. I'll address these issues here and explain why the 80486 and Pentium caches were designed like that based on information from Intel.
The 80386 does have an external cache controller with an external unified cache. However, just because the cache is external doesn't necessarily mean that it's likely to be unified. Consider the R3000 processor, which was released three years after 80386 and is of the same generation as the 80486. The designers of R3000 opted for a large external cache instead of a small on-chip cache to improve performance according to Section 1.8 of PaceMips R3000 32-Bit, 25 MHz RISC CPU with Integrated Memory Management Unit. The first section of Chapter 1 of the R3000 Software Reference Manual says that the external cache uses the split design so that it can perform an instruction fetch and a read or write data access in the same "clock phase." It's not clear to me how this exactly works though. My understanding is that the external data and address buses are shared between the two caches and with memory as well. (Also, some of the address wires are used to provide cache line tags to the on-chip cache controller for tag matching.) Both caches are direct-mapped, maybe to achieve a single-cycle access latency. A unified external cache design with the same bandwdith, associativity, and capacity requires the cache to be fully dual-ported, or the VSC design could be used but VSC was invented many years later. Such a unified cache would be more expensive and may have a latency larger than the required single cycle to keep the pipeline filled with instructions.
Another issue with the linked answer from Retro is that just because the 80486 evolved directly from the 80386 doesn't necessarily mean that it has also use the unified design. According to the Intel paper titled The i486 CPU: executing instructions in one clock cycle, Intel evaluated both designs and deliberately chose to go for the unified on-chip design. Compared to the same-gen R3000, both processors have similar frequency ranges and the off-chip data width is 32 bits in both processors. However, the unified cache of the 80486 is much smaller than total cache capacity of the R3000 (up to 16KB vs. up to 256KB+256KB). On the other hand, being on-chip made it more feasible for the 80486 to have wider cache buses. In particular, the 80486 cache has a 16-byte instruction fetch bus, a 4-byte data load bus, and a 4-byte data load/store bus. The two data buses could be used at the same time to load a single 8-byte operand (double-precision FP operand or segment desc) in one access. The R3000 caches share a single 4-byte bus. The relatively small size of the 80486 cache may have allowed making it 4-way associative with a single-cycle latency. This means that a load instruction that hits in the cache can supply the data to a dependent instruction in the next cycle without any stalls. On the R3000, if an instruction depends on an immediately preceding load instruction, it has to stall for one cycle in the best-case scenario of a cache hit.
The 80486 cache is single-ported, but the instruction prefetch buffer and the wide 16-byte instruction fetch bus helps keeping contention between instruction fetches and data accesses to minimum. Intel mentions that simulation results show that the unified design provides a hit rate that is higher than that of a split cache enough to compensate for the bandwdith contention.
Intel explained in another paper titled Design of the Intel Pentium processor why they decided to change the cache in the Pentium to split. There are two reasons: (1) The 2-wide superscalar Pentium requires the ability to perform up to two data accesses in a single cycle, and (2) Branch prediction increases cache bandwdith demand. The paper doesn't mention whether Intel considered using a triple-ported banked unified cache, but they probably did and found out that it's not feasible at that time, so they went for a split cache with a dual-ported 8-banked data cache and a single-ported instruction cache. With today's fab technology, the triple-ported unified design may be better
Wider pipelines in later microarchitectures required higher parallelism at the data cache. Now we're at 4 64-byte ports in Sunny Cove.
Answering the Second Part of the Question
it was mentioned that the cache is a split cache, and no hazard what
does this exactly means?
It's probably about the structural hazard mentioned in Paul's comment. That is, a unified single-ported cache cannot be accessed by the instruction fetch unit and the memory unit at the time.

Can Multiprocessor CPUs avoid context-switching?

Today's computer architecture are trying to maximize the number of registers. It is faster to access a register (which is an integrated memory circuit near the cpu) than to access first-level cache. The problem is, that each context switch has to save all registers into cache, because the next thread needs other register values. What a modern CPU is doing is to cycle in one second through 100 tasks and everytime it saves the registers, and fetches the old one until the task can be started.
IMHO it would be nice to use one CPU for one task, and no context switching is happening. That means we get 100 CPUs, each 1000 registers which has to be never saved. Is that possible or have I a ignored an important detail?

The only way to completely avoid context switching is by having at least as many cores as there are tasks. Generally, there is no guarantee regarding the maximum number of tasks that may run. Current GPUs and manycore processors and co-processors contain hundreds of small cores. If you put multiple of these things in the same system or in a cluster of systems, you can have thousands or more cores. Still, even if you could avoid context switching with such design, these cores are much slower than the traditional high-end CPU cores, so the net effect might be negative.
But let's take a step back here. The number of context switches is not primarily determined by the number of tasks and cores. Tasks don't just perform computations, they also need to interact with I/O devices and wait for things to happen such as results from other tasks or user input. So some tasks would be in a wait state. The overhead of context switching depends on not only the number of tasks but also the behavior of these tasks.
Both processors architects and OS developers are aware of context switching overhead and employ a variety of techniques to alleviate it. For example, x86 provides a number of instructions that are tuned to saving the context (partially) of the current task. The OS thread scheduler uses techniques such as priorities, preemption (with possibly large time slices on servers), and priority boosting. All of these help reducing the number of context switches and therefore their overall overhead. In addition, reducing the overhead of context switching is not the only thing that matters. In particular, the responsiveness of the system is very important as well, which is at odds with that overhead.

Context Switch questions: What part of the OS is involved in managing the Context Switch?

I was asked to anwer these questions about the OS context switch, the question is pretty tricky and I cannot find any answer in my textbook:
How many PCBs exist in a system at a particular time?
What are two situations that could cause a Context Switch to occur? (I think they are interrupt and termination of a process,but I am not sure )
Hardware support can make a difference in the amount of time it takes to do the switch. What are two different approaches?
What part of the OS is involved in managing the Context Switch?

There can be any number of PCBs in the system at a given moment in time. Each PCB is linked to a process.
Timer interrupts in preemptive kernels or process renouncing control of processor in cooperative kernels. And, of course, process termination and blocking at I/O operations.
I don't know the answer here, but see Marko's answer
One of the schedulers from the kernel.

3: A whole number of possible hardware optimisations
Small register sets (therefore less to save and restore on context switch)
'Dirty' flags for floating point/vector processor register set - allows the kernel to avoid saving the context if nothing has happened to it since it was switched in. FP/VP contexts are usually very large and a great many threads never use them. Some RTOSs provide an API to tell the kernel that a thread never uses FP/VP at all eliminating even more context restores and some saves - particularly when a thread handling an ISR pre-empts another, and then quickly completes, with the kernel immediately rescheduling the original thread.
Shadow register banks: Seen on small embedded CPUs with on-board singe-cycle SRAM. CPU registers are memory backed. As a result, switching bank is merely a case of switching base-address of the registers. This is usually achieved in a few instructions and is very cheap. Usually the number of context is severely limited in these systems.
Shadow interrupt registers: Shadow register banks for use in ISRs. An example is all ARM CPUs that have a shadow bank of about 6 or 7 registers for its fast interrupt handler and a slightly fewer shadowed for the regular one.
Whilst not strictly a performance increase for context switching, this can help ith the cost of context switching on the back of an ISR.
Physically rather than virtually mapped caches. A virtually mapped cache has to be flushed on context switch if the MMU is changed - which it will be in any multi-process environment with memory protection. However, a physically mapped cache means that virtual-physical address translation is a critical-path activity on load and store operations, and a lot of gates are expended on caching to improve performance. Virtually mapped caches were therefore a design choice on some CPUs designed for embedded systems.

The scheduler is the part of the operating systems that manage context switching, it perform context switching in one of the following conditions:
1.Multitasking
2.Interrupt handling
3.User and kernel mode switching
and each process have its own PCB

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?

Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D

A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse