which cache level does cache-misses consider? - multicore

As you know, there is an event in Perf called cache-misses. there are two or three levels of cache in multi-core systems. May i know cache which of them is considered by cache-misses event?

Related

How does one core get work in a multi-core system?

Is it the Operating System who delegates any job to core?
What is that specific algorithm or a way, on which it is decided that the next task will be assigned to which cpu core?
Correct, it is the operating system's responsibility to designate tasks for the CPU to complete, regardless of how many cores it has. It does this via a scheduling algorithm, which decides in what order tasks/processes should be executed. In a symmetric multiprocessing environment, the OS views each core as an independent, identical CPU and therefore schedules them individually. When several cores are available, there are a couple important things to keep in mind:
1. Load balancing- For maximum performance, each core should be performing roughly the same amount of work.
2. Affinity- Because of caching, it is best (in terms of performance) for processes to complete the entirety of their execution on just one processor.
These things need to be kept in mind along with the traditional scheduling considerations of priority, fairness etc. Obviously, this topic is far too large for just one post to handle, so here are some resources that go in to further detail:
https://www.tutorialspoint.com/operating_system/os_process_scheduling_algorithms.htm
https://www.geeksforgeeks.org/multiple-processor-scheduling-in-operating-system/

Can Multiprocessor CPUs avoid context-switching?

Today's computer architecture are trying to maximize the number of registers. It is faster to access a register (which is an integrated memory circuit near the cpu) than to access first-level cache. The problem is, that each context switch has to save all registers into cache, because the next thread needs other register values. What a modern CPU is doing is to cycle in one second through 100 tasks and everytime it saves the registers, and fetches the old one until the task can be started.
IMHO it would be nice to use one CPU for one task, and no context switching is happening. That means we get 100 CPUs, each 1000 registers which has to be never saved. Is that possible or have I a ignored an important detail?
The only way to completely avoid context switching is by having at least as many cores as there are tasks. Generally, there is no guarantee regarding the maximum number of tasks that may run. Current GPUs and manycore processors and co-processors contain hundreds of small cores. If you put multiple of these things in the same system or in a cluster of systems, you can have thousands or more cores. Still, even if you could avoid context switching with such design, these cores are much slower than the traditional high-end CPU cores, so the net effect might be negative.
But let's take a step back here. The number of context switches is not primarily determined by the number of tasks and cores. Tasks don't just perform computations, they also need to interact with I/O devices and wait for things to happen such as results from other tasks or user input. So some tasks would be in a wait state. The overhead of context switching depends on not only the number of tasks but also the behavior of these tasks.
Both processors architects and OS developers are aware of context switching overhead and employ a variety of techniques to alleviate it. For example, x86 provides a number of instructions that are tuned to saving the context (partially) of the current task. The OS thread scheduler uses techniques such as priorities, preemption (with possibly large time slices on servers), and priority boosting. All of these help reducing the number of context switches and therefore their overall overhead. In addition, reducing the overhead of context switching is not the only thing that matters. In particular, the responsiveness of the system is very important as well, which is at odds with that overhead.

Why are CAS (Atomic) operations faster than synchronized or volatile operations

From what I understand, synchronized keyword syncs local thread cache with main memory. volatile keyword basically always reads the variable from the main memory at every access. Of course accessing main memory is much more expensive than local thread cache so these operations are expensive. However, a CAS operation use low level hardware operations but still has to access main memory. So how is a CAS operation any faster?
I believe the critical factor is as you state - the CAS mechanisms use low-level hardware instructions that allow for the minimal cache flushing and contention resolution.
The other two mechanisms (synchronization and volatile) use different architectural tricks that are more generally available across all different architectures.
CAS instructions are available in one form or another in most modern architectures but there will be a different implementation in each architecture.
Interesting quote from Brian Goetz (supposedly)
The relative speed of the operations is largely a non-issue. What is relevant is the difference in scalability between lock-based and non-blocking algorithms. And if you're running on a 1 or 2 core system, stop thinking about such things.
Non-blocking algorithms generally scale better because they have shorter "critical sections" than lock-based algorithms.
Note that a CAS does not necessarily have to access memory.
Most modern architectures implement a cache coherency protocol like MESI that allows the CPU to do shortcuts if there is only one thread accessing the data at the same time. The overhead compared to traditional, unsynchronized memory access is very low in this case.
When doing a lot of concurrent changes to the same value however, the caches are indeed quite worthless and all operations need to access main memory directly. In this case the overhead for synchronizing the different CPU caches and the serialization of memory access can lead to a significant performance drop (this is also known as cache ping-pong), which can be just as bad or even worse than what you experience with lock-based approaches.
So never simply assume that if you switch to atomics all your problems go away. The big advantage of atomics are the progress guarantees for lock-free (someone always makes progress) or wait-free (everyone finishes after a certain number of steps) implementations. However, this is often orthogonal to raw performance: A wait-free solution is likely to be significantly slower than a lock-based solution, but in some situations you are willing to accept that in order to get the progress guarantees.

Multi-core processor for multiple data containers

I have a dual core Intel processor and would like to use one core for processing certain commands like SATA writes and another for reads, how do we do it? Can this be controlled from the application(with multiple threads) or would this require a change in the kernel to ensure the reads/writes dont get processed by the the 'wrong' core?
This will be pretty much totally up to your operating system, which you haven't specified.
Some may offer thread affinity to try and keep one thread on the same execution engine (be that a core or a CPU), but that's only for threads. If two threads both write to disk, then they may well do so on different engines.
If you want that sort of low level control, it's probably best to do it at the kernel level.
My question to you would by "Why?". A great deal of performance tuning goes into OS kernels and they would generally know better than any application how to efficiently do this low level stuff.

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.