Multi-core processor for multiple data containers - multicore

I have a dual core Intel processor and would like to use one core for processing certain commands like SATA writes and another for reads, how do we do it? Can this be controlled from the application(with multiple threads) or would this require a change in the kernel to ensure the reads/writes dont get processed by the the 'wrong' core?

This will be pretty much totally up to your operating system, which you haven't specified.
Some may offer thread affinity to try and keep one thread on the same execution engine (be that a core or a CPU), but that's only for threads. If two threads both write to disk, then they may well do so on different engines.
If you want that sort of low level control, it's probably best to do it at the kernel level.
My question to you would by "Why?". A great deal of performance tuning goes into OS kernels and they would generally know better than any application how to efficiently do this low level stuff.

Related

How does one core get work in a multi-core system?

Is it the Operating System who delegates any job to core?
What is that specific algorithm or a way, on which it is decided that the next task will be assigned to which cpu core?
Correct, it is the operating system's responsibility to designate tasks for the CPU to complete, regardless of how many cores it has. It does this via a scheduling algorithm, which decides in what order tasks/processes should be executed. In a symmetric multiprocessing environment, the OS views each core as an independent, identical CPU and therefore schedules them individually. When several cores are available, there are a couple important things to keep in mind:
1. Load balancing- For maximum performance, each core should be performing roughly the same amount of work.
2. Affinity- Because of caching, it is best (in terms of performance) for processes to complete the entirety of their execution on just one processor.
These things need to be kept in mind along with the traditional scheduling considerations of priority, fairness etc. Obviously, this topic is far too large for just one post to handle, so here are some resources that go in to further detail:
https://www.tutorialspoint.com/operating_system/os_process_scheduling_algorithms.htm
https://www.geeksforgeeks.org/multiple-processor-scheduling-in-operating-system/

Context Switch questions: What part of the OS is involved in managing the Context Switch?

I was asked to anwer these questions about the OS context switch, the question is pretty tricky and I cannot find any answer in my textbook:
How many PCBs exist in a system at a particular time?
What are two situations that could cause a Context Switch to occur? (I think they are interrupt and termination of a process,but I am not sure )
Hardware support can make a difference in the amount of time it takes to do the switch. What are two different approaches?
What part of the OS is involved in managing the Context Switch?
There can be any number of PCBs in the system at a given moment in time. Each PCB is linked to a process.
Timer interrupts in preemptive kernels or process renouncing control of processor in cooperative kernels. And, of course, process termination and blocking at I/O operations.
I don't know the answer here, but see Marko's answer
One of the schedulers from the kernel.
3: A whole number of possible hardware optimisations
Small register sets (therefore less to save and restore on context switch)
'Dirty' flags for floating point/vector processor register set - allows the kernel to avoid saving the context if nothing has happened to it since it was switched in. FP/VP contexts are usually very large and a great many threads never use them. Some RTOSs provide an API to tell the kernel that a thread never uses FP/VP at all eliminating even more context restores and some saves - particularly when a thread handling an ISR pre-empts another, and then quickly completes, with the kernel immediately rescheduling the original thread.
Shadow register banks: Seen on small embedded CPUs with on-board singe-cycle SRAM. CPU registers are memory backed. As a result, switching bank is merely a case of switching base-address of the registers. This is usually achieved in a few instructions and is very cheap. Usually the number of context is severely limited in these systems.
Shadow interrupt registers: Shadow register banks for use in ISRs. An example is all ARM CPUs that have a shadow bank of about 6 or 7 registers for its fast interrupt handler and a slightly fewer shadowed for the regular one.
Whilst not strictly a performance increase for context switching, this can help ith the cost of context switching on the back of an ISR.
Physically rather than virtually mapped caches. A virtually mapped cache has to be flushed on context switch if the MMU is changed - which it will be in any multi-process environment with memory protection. However, a physically mapped cache means that virtual-physical address translation is a critical-path activity on load and store operations, and a lot of gates are expended on caching to improve performance. Virtually mapped caches were therefore a design choice on some CPUs designed for embedded systems.
The scheduler is the part of the operating systems that manage context switching, it perform context switching in one of the following conditions:
1.Multitasking
2.Interrupt handling
3.User and kernel mode switching
and each process have its own PCB

How are the stack pointer and program status word maintained in multiprocessor architecture?

In a multi-processor architecture, how are registers organized?
For example, in a 4 cores processor, a minimum of 4 processes can run at a time.
How are stack pointer, program status registers and program counter organized?
What about other general purpose registers?
My guess is, each core will have a separate set of registers.
Imagine 4 completely separate computers, each with a single-core CPU. A 4-core computer is like that; except:
All CPUs share the same physical address space (and can all use the same RAM, PCI devices, etc)
Interrupt/IRQ controllers may be designed so the OS can tell it which CPU/s should be interrupted by the IRQ
CPUs are typically able to signal each other (e.g. "inter-processor interrupts")
Some CPUs may share some caches
Some CPUs may share some control registers (e.g. for things like power management, cache configuration, etc)
For modern CPUs, some CPUs may share some or all execution units (SMT, hyper-threading, etc)
For modern systems (where memory controller is built into the physical chip) some CPUs may share the same memory controller
Most of this is "invisible" to most software. Unless you're writing part of an OS that controls power management, you don't need to care if power management is shared between CPUs or not; unless you're writing an OSs/kernel's low level IRQ handling you don't need to care how IRQs reach device drivers, etc.
The same applies to how many CPUs actually exist. The OS/kernel normally ensures that applications only need to care about higher level abstractions (e.g. "threads"). How this higher level abstraction works depends on the OS - normally (for most OSs) the OS/kernel attempts to provide the illusion that all threads are running at the same time by switching between them "quickly enough" (where if there's only 4 CPUs a maximum of 4 threads actually do run at the exact same time), but it's usually far more complex than this (involving things like thread priorities, pre-emption rules, etc) and (even though it's relatively rare) it may be very different (e.g. for some systems the same thread may be run on multiple CPUs at the same time for fault tolerance/redundancy purposes; for some systems there might just be a queue of functions and their data, where multiple functions run at the same time; etc).
Multiprocessor means that there are at least two discrete processors on the same platform -- usually on the same motherboard
A subset is distributed multiprocessing, where two PC's for example are programmed to appear as a single system with two processors
Multicore means that the most or all of the CPU is replicated many times on single chip.
- this also means that stack, status, program counter and all generic purpose registers are replicated.
Hyperthreading is a technique, where each stage of the pipeline executes commands from different processes.
Multiprocessing means in OS level that everything a process consists of, is switched every now and then.
Multithreading is a lightweight variant of multiprocessing, where the threads e.g. share the same code segment and same data segment, same file descriptors etc. but have unique stacks (and of course unique status registers and program counters)
Also means multiprocessing in general (hardware architecture)

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.

What are the differences between multi-CPU, multi-core and hyper-thread?

Could anyone explain to me the differences between multi-CPU, multi-core, and hyper-thread? I am always confused about these differences, and about the pros/cons of each architecture in different scenarios.
Here is my current understanding after learning online and learning from others' comments.
I think hyper-thread is the most inferior technology among them, but cheap. Its main idea is duplicate registers to save context switch time;
Multi processor is better than hyper-thread, but since different CPUs are on different chips, the communication between different CPUs is of longer latency than multi-core, and using multiple chips, there is more expense and more power consumption than with multi-core;
multi-core integrates all the CPUs on a single chip, so the latency of communication between different CPUs are greatly reduced compared with multi-processor. Since it uses one single chip to contain all CPUs, it consumer less power and is less expensive than a multi processor system.
Is this correct?
Multi-CPU was the first version: You'd have one or more mainboards with one or more CPU chips on them. The main problem here was that the CPUs would have to expose some of their internal data to the other CPU so they wouldn't get in their way.
The next step was hyper-threading. One chip on the mainboard but it had some parts twice internally so it could execute two instructions at the same time.
The current development is multi-core. It's basically the original idea (several complete CPUs) but in a single chip. The advantage: Chip designers can easily put the additional wires for the sync signals into the chip (instead of having to route them out on a pin, then over the crowded mainboard and up into a second chip).
Super computers today are multi-cpu, multi-core: They have lots of mainboards with usually 2-4 CPUs on them, each CPU is multi-core and each has its own RAM.
[EDIT] You got that pretty much right. Just a few minor points:
Hyper-threading keeps track of two contexts at once in a single core, exposing more parallelism to the out-of-order CPU core. This keeps the execution units fed with work, even when one thread is stalled on a cache miss, branch mispredict, or waiting for results from high-latency instructions. It's a way to get more total throughput without replicating much hardware, but if anything it slows down each thread individually. See this Q&A for more details, and an explanation of what was wrong with the previous wording of this paragraph.
The main problem with multi-CPU is that code running on them will eventually access the RAM. There are N CPUs but only one bus to access the RAM. So you must have some hardware which makes sure that a) each CPU gets a fair amount of RAM access, b) that accesses to the same part of the RAM don't cause problems and c) most importantly, that CPU 2 will be notified when CPU 1 writes to some memory address which CPU 2 has in its internal cache. If that doesn't happen, CPU 2 will happily use the cached value, oblivious to the fact that it is outdated
Just imagine you have tasks in a list and you want to spread them to all available CPUs. So CPU 1 will fetch the first element from the list and update the pointers. CPU 2 will do the same. For efficiency reasons, both CPUs will not only copy the few bytes into the cache but a whole "cache line" (whatever that may be). The assumption is that, when you read byte X, you'll soon read X+1, too.
Now both CPUs have a copy of the memory in their cache. CPU 1 will then fetch the next item from the list. Without cache sync, it won't have noticed that CPU 2 has changed the list, too, and it will start to work on the same item as CPU 2.
This is what effectively makes multi-CPU so complicated. Side effects of this can lead to a performance which is worse than what you'd get if the whole code ran only on a single CPU. The solution was multi-core: You can easily add as many wires as you need to synchronize the caches; you could even copy data from one cache to another (updating parts of a cache line without having to flush and reload it), etc. Or the cache logic could make sure that all CPUs get the same cache line when they access the same part of real RAM, simply blocking CPU 2 for a few nanoseconds until CPU 1 has made its changes.
[EDIT2] The main reason why multi-core is simpler than multi-cpu is that on a mainboard, you simply can't run all wires between the two chips which you'd need to make sync effective. Plus a signal only travels 30cm/ns tops (speed of light; in a wire, you usually have much less). And don't forget that, on a multi-layer mainboard, signals start to influence each other (crosstalk). We like to think that 0 is 0V and 1 is 5V but in reality, "0" is something between -0.5V (overdrive when dropping a line from 1->0) and .5V and "1" is anything above 0.8V.
If you have everything inside of a single chip, signals run much faster and you can have as many as you like (well, almost :). Also, signal crosstalk is much easier to control.
You can find some interesting articles about dual CPU, multi-core and hyper-threading on Intel's website or in a short article from Yale University.
I hope you find here all the information you need.
In a nutshell: multi-CPU or multi-processor system has several processors. A multi-core system is a multi-processor system with several processors on the same die. In hyperthreading, multiple threads can run on the same processor (that is the context-switch time between these multiple threads is very small).
Multi-processors have been there for 30 years now but mostly in labs. Multi-core is the new popular multi-processor. Server processors nowadays implement hyperthreading along with multi-processors.
The wikipedia articles on these topics are quite illustrative.
Hyperthreading is a cheaper and slower alternative to having multiple-cores
The Intel Manual Volume 3 System Programming Guide - 325384-056US September 2015 8.7 "INTEL HYPER-THREADING TECHNOLOGY ARCHITECTURE" describes HT briefly. It contains the following diagram:
TODO it is slower by how much percent in average in real applications?
Hyperthreading is possible because modern single CPUs cores already execute multiple instructions at once with the instruction pipeline https://en.wikipedia.org/wiki/Instruction_pipelining
The instruction pipeline is a separation of functions inside of a single core to ensure that each part of the circuit is used at any given time: reading memory, decoding instructions, executing instructions, etc.
Hyperthreading separates functions further by using:
a single backend, which actually runs the instructions with its pipeline.
Dual core has two backends, which explains the greater cost and performance.
two front-ends, which take two streams of instructions and order them in a way to maximize pipelining usage of the single backend by avoiding hazards.
Dual core would also have 2 front-ends, one for each backend.
There are edge cases where instruction reordering produces no benefit, making hyperthreading useless. But it produces a significant improvement in average.
Two hyperthreads in a single core share further cache levels (TODO how many? L1?) than two different cores, which share only L3, see:
Multiple threads and CPU cache
How are cache memories shared in multicore Intel CPUs?
The interface that each hyperthread exposes to the operating system is similar to that of an actual core, and both can be controlled separately. Thus cat /proc/cpuinfo shows me 4 processors, even though I only have 2 cores with 2 hyperthreads each.
Operating systems can however take advantage of knowing which hyperthreads are on the same core to run multiple threads of a given program on a single core, which might improve cache usage.
This LinusTechTips video contains a light-hearted non-technical explanation: https://www.youtube.com/watch?v=wnS50lJicXc
Multi-CPU is a bit like multicore, but communication can only happen through RAM, not L3 cache
This means that if possible, you want to partition tasks that use the same memory a lot for each separate CPU.
E.g. the following SBI-7228R-T2X blade server contains 4 CPUs, 2 on each node:
Source.
We can see that there seem to be 4 sockets for the CPUs, each covered by a heat sink, with one open.
I think across the nodes, they don't even share RAM memory and must communicate through some kind of networking, thus representing one further step up on the hyperthread/multicore/multi-CPU hierarchy, TODO confirm:
https://scicomp.stackexchange.com/questions/7530/difference-between-nodes-and-cpus-when-running-software-on-a-cluster
SLURM nodes, tasks, cores, and cpus
https://www.quora.com/In-High-Performance-Computing-what-exactly-is-the-difference-between-the-terms-%E2%80%9Ccores-%E2%80%9D-%E2%80%9Cprocessors-%E2%80%9D-%E2%80%9Cnodes-%E2%80%9D-and-%E2%80%9Cclusters%E2%80%9D