meaning of processor bandwidth - real-time

My question is about something make me confused. I have read some references about real-time scheduling policies, and I have read about Constant Bandwidth Server policy which gives a constant bandwidth for each task from the processor to be execute.
My question is what is the meaning of processor bandwidth?

Related

Throughput vs latency in computer architecture

I've come across articles on "through-put vs latency" in contexts like networking e.g. https://homepage.cs.uri.edu/~thenry/resources/unix_art/ch12s04.html But in the context of computer architecture / operating systems, I'm not able to understand why would there be a trade-off between latency (response time of a program) and through-put (how many programs we're able to complete in a unit of time, say per hour). Is this solely due to the fact that we can choose to parallelize processing of multiple programs / requests leading to overheads like context switches & sharing of caches which make the start-to-end response time per process to be worse? Or am I missing something here?
In terms of single instructions in a superscalar pipelined out-of-order exec CPU, throughput vs. latency is very important because the CPU is trying to extract parallelism from an instruction stream that has to be executed as if in serial program order. See Assembly - How to score a CPU instruction by latency and throughput and the bottom of my answer on latency vs throughput in intel intrinsics for example.
In terms of OS decisions that affect throughput vs. latency on a much longer timescale than a few clock cycles, that's a totally separate question.
One of the major factors there is choosing how to use the available physical RAM, and whether to page out (to a swap file) infrequently used code / data to make more room to cache disk files. (e.g. Linux's vm.swappiness is widely considered a key tunable in terms of setting it differently between servers and desktops. https://unix.stackexchange.com/questions/88693/why-is-swappiness-set-to-60-by-default).
If you alt-tab to a window when many pages of that process have been paged out, it will take some time before the process can redraw its window. (Multiple hard page faults, can be quite slow especially if paging on a rotational disk, not SSD.) So to optimize for latency, you want the kernel to not aggressively swap out pages from running processes, even if they've been idle for a few hours. Those pages, if they'd been free, could have improved throughput for other processes by acting as buffers / cache.
A related factor is I/O scheduling: trying to group IO requests together to minimize HD seek times (for higher throughput and lower average latency), but sometimes at the expense of delaying a few requests for a longer time (higher worst-case latency). Linux for example has many to choose from, including deadline, Completely Fair Queuing (CFQ), and the original elevator (just grouping requests by locality without consideration of fairness or latency). https://wiki.archlinux.org/title/improving_performance#Input/output_schedulers
CPU scheduling is also a factor: a context-switch hurts throughput, as it takes time itself and caches will likely be cold for the new task on this CPU. You also have to run the kernel's schedule() function to decide which task to run next, so that takes away some time from real work.
To minimize latency (for example between a socket message being sent to a process and it waking up when its poll or select system call returns), you want a short timeslice, like Linux HZ=1000. (Timer interrupts every 1 ms to run the scheduler). And you want to be able to pre-empt even the kernel itself, instead of waiting until the kernel is ready to return to the old user-space to consider the possibility of running a different user-space task.
But neither of these helps throughput, and in fact hurt (assuming the workload has enough parallelism to not bottleneck on latency). So HZ=100 was the default for "server" Linux builds, vs. 1000 on "desktop" builds tuned for interactive use. (Modern Linux can be "tickless", not using a fixed timer interrupt on every core at all, instead deciding when to schedule the next interrupt on a case by case basis.)
Real-time kernels take this even further, spending more time on finer-grained locking and stuff like that to enable pausing work and coming back to it later to minimize interrupt latency and other latencies between it being time to do something and actually starting to do that thing. (There are real-time patches for Linux, and there are also totally separate kernels built from the ground up for real-time operation.)
If you have an embedded system controlling a motor or something, you absolutely need hard real-time latency guarantees that it will never take longer than say 1 millisecond from an interrupt pin being asserted to the interrupt handler starting to run.
(Designing the system to make these guarantees possible often comes at the cost of throughput. e.g. obviously you have to pin some memory to make it not swappable, if we're talking about user-space, making it unavailable for cache even if it goes untouched for days.)

How does Scylla handles agressive memflush & compaction on write work load?

How does Scylla guarantee/keeps write latency low for write workload case, as more write would produce more memflush and compaction? Is there a throttling to it? Would be really helpful if someone can asnwer.
In essence, Scylla provides consistent low latency by parallelizing the problem, and then properly prioritizing user-facing vs. back-office tasks.
Parallelizing - Scylla uses a shard-per-thread architecture. Each thread is responsible for all activities for its token range. Reads, writes, compactions, repairs, etc.
Prioritizing - Each thread is scheduled according to the priorities of the tasks. High priority tasks like dealing with read (query) and write (commitlog) receive the highest priority. Back-office tasks such as memtable flushes, compaction and repair are only done when there are spare cycles. Which - given the nanosecond scale of modern CPUs - there usually are.
If there are not enough spare cycles, and RAM or Disk start to fill, Scylla will bump the priority of the back-office tasks in order to save the node. So that will, in fact, inject some latency. But that is an indication that you are probably undersized, and should add some resources.
I would recommend starting with the Scylla Architecture whitepaper at https://go.scylladb.com/real-time-big-data-database-principles-offer.html. There are also many in-depth talks from Scylla developers at https://www.scylladb.com/resources/tech-talks/
For example, https://www.scylladb.com/2020/03/26/avi-kivity-at-core-c-2019/ talks at great depth about shard-per-core.
https://www.scylladb.com/tech-talk/oltp-or-analytics-why-not-both/ talks at great depth about task prioritization.
Memtable flush is more urgent than regular compaction as we on one hand want to flush late, in order to create fewer sstables in level 0 and on the other, we like to evacuate memory from ram. We have a memory controller which automatically determine the flush condition. It is done in the background while operations for to the commitlog and get flushed according to the configured criteria.
Compaction is more of a background operation and we have controllers for it too. Go ahead and search the blog for compaction

What is a limitation of using DMA transfer for slow periodic data? [duplicate]

I am learning about the STM32 F4 microcontroller. I'm trying to find out about limitations for using DMA.
Per my understanding and research, I know that if the data size is small (that is, the device uses DMA to generate or consume a small amount of data), the overhead is increased because DMA transfer requires the DMA controller to perform operations, thereby unnecessarily increasing system cost.
I did some reaserch and found the following:
Limitation of DMA
CPU puts all its lines at high impedance state so that the DMA controller can then transfer data directly between device and memory without CPU intervention. Clearly, it is more suitable for device with high data transfer rates like a disk.
Over a serial interface, data is transferred one bit at a time which makes it slow to use DMA.
Is that correct? What else do I need to know?
DMA -CPU puts all its lines at high impedance state
I do not know where did you take it from - but you should not use this source any more.
Frequency of the DMA transfers do not matter unless you reach the the BUS throughput. you can transfer one byte per week, month, year, decade ..... and it is absolutely OK.
In the STM32 microcontrollers it is a very important feature as we can transfer data from/to external devices even if the uC is in low power mode with the core (CPU) sleeping. DMA controller can even wake up the core when some conditions are met.
As #Vinci and #0___________ (f.k.a. #P__J__) already pointed out,
A DMA controller works autonomously and doesn't create overhead on the CPU it supplements (at least not by itself). But:
The CPU/software must perform some instructions to configure the DMA and to trigger it or have it triggered by some peripheral. For this, it needs CPU time and program memory space (usually ROM). Besides, it usually needs some additional RAM in variables to manage the software around the DMA.
Hence, you are right, using a DMA comes with some kinds of overhead.
And furthermore,
The DMA transfers make use of the memory bus(es) that connect the involved memories/registers/peripherals to the DMA controller. That is, while the DMA controller does its own work, it may cause the CPU which it tries to offload to stall in the meantime, at least for short moments when the data words are transferred (which in turn sum up for longer transfers...).
On the other hand, a DMA doesn't only help you to reduce the CPU load (regarding total CPU time to implement some feature). If used "in a smart way", it helps you to reduce software latencies to implement different functions because one part of the implementation can be "hidden" behind the DMA-driven data transfer of another part (unless, both rely on the same bus resources - see above...).
The information is right in that using a DMA requires some development work and some runtime to manage the DMA transfer itself (see also
a related question
here), which may not be worth the benefits of using DMA. That is, for small portions of data one doesn't gain as much performance (or latency) as during big transfers. On embedded systems, DMA controllers (and their channels) are limited resources so it is important to consider which part of the function benefits from such a resource most. Therefore, one would usually prefer using DMA for the data transfers to/from disks (if it is about "payload data" such as large files or video streams) over slow serial connections.
The information is wrong, however, in that DMA is not worth using on serial interfaces as those only transfer a single bit at a time. Please note that microcontrollers (as your
STM32F4)
have built-in peripheral components that convert the serial bit-by-bit stream into a byte-by-byte or word-by-word stream, which can easily be tranferred by DMA in a helpful way - especially if the size of the packets is known in advance and software doesn't have to analyse a non-formatted stream. Furthermore, not every serial connection is "slow" at all. If the project uses, e. g., an SPI flash chip, then the SPI serial connection is the one used for data transfer.

Direct Memory Access

I have this basic doubt about DMA. When the CPU has relinquished the bus for DMA to carry on with data fetching/ storing, how does it continue processing?
I mean even the CPU's got to get it instructions, store back results to the memory/IOs through the bus, does it not?
CPUs have cache, so they can do a lot without any actual main-memory accesses. Even low-power systems tend to have caches, because driving signals off-chip costs enough energy that a cache pays for itself in energy saved by cache hits.
More importantly, DMA doesn't "take over" the RAM, or even necessarily saturate memory bandwidth. The CPU doesn't "relinquish the bus"; the memory controller accepts read/write requests from the CPU core(s) and other system devices. Running a memory-heavy task on the CPU will slow down delay DMA, as well as the other way around, as the memory controller or system agent arbitrates access to memory, queuing read and write requests from all sources.
DMA is great for transfers that are still much slower than memory bandwidth. For example SATAIII is 6 Gbits/s, while main memory bandwidth for dual-channel DDR3-1600MHz is about 25 GBytes/s. So programmed-io would spend most of its time waiting for data from the SATA controller, not even bottlenecked on storing to RAM.
An example of how the pieces fit together in a modern Intel x86 CPU:
this diagram of Intel Skylake's system architecture (including eDRAM as memory-side cache). Sorry I didn't find a simpler diagram showing just the cores and system agent, but in a system without eDRAM, the only thing to the right of the system agent is the memory controller, and everything else stays the same.
The memory controller is on-die, so the only off-chip connection in this diagram is the PCIe bus.
There are two basic types of DMA usage models. First is when a CPU is waiting for the DMA to finish - SYNCed operation or a blocking DMA call. The other is when the CPU makes an ASYNC (or non-blocking) DMA request. This lets CPU continue with the regular control flow. This way it can off-load work to DMA to do something more important.
If I understand your question correctly, and as Peter said, when a CPU has made a non-blocking DMA request, and the DMA is actively doing something on the bus, still CPU can do all the regular operations including accessing the RAM because the bus can have multiplexed traffic. Or in other words the bus can handle multiple masters at the same time.
The coherency and consistency, which makes things little more complicated, are generally maintained using the right programming paradigms based on the hardware support.

RTOS example where GPOS will most likely fail

I want to know a few application examples where one needs to use RTOS in order to ensure a working system.
I did some google search and whatever examples I found, I feel could be implemented using a windows or linux system.
The primary difference between an RTOS and a GPOS is that an RTOS guarantees deterministic response. That is to say that the worst case response time to an event is precisely bounded (and usually fast). A GPOS schedules processes generally on "balanced load" basis - it assumes that all processes and events are of equal importance and will be allotted a "fair" share of processor resources. For that reason when a process has the CPU, unless it yields "cooperatively" it will have sole use of the CPU for the duration of its time slot (assuming a single core - multi-core processors allow true concurrency, but the GPOS still allots the cores of a balanced load basis). A time slot may be several tens of milliseconds, and the time taken to service a particular process will depend greatly on the number of processes simultaneously demanding CPU time. Outside perhaps of implementing a kernel level driver, achieving timing constraints of a few tens of microseconds (or less) is not possible (or desirable) in a GPOS.
If your application is what Microsoft's marketing used to call "soft" real-time (i.e. not real time at all) that a GPOS may suit. Linux can be built with "real-time" scheduling support, but it does not really make Linux suited to a large set of "hard" real-time tasks, and it is still "soft" in the sense that most of the time it will meet deadlines, but in some outlier conditions it may fail. If that is your medical life-support system, you probably don't want to trust to that!
As an example of an attempt to run essentially real-time tasks on a GPOS that fails, years ago when MMX instructions were added to Pentium processors (running typically at 60MHz then), someone had the bright idea of "Host Signal Processing", a method applied to reduce the cost of PSTN modems (dial-up) by performing the signal processing on the PC rather than using a dedicated processor or DSP in the modem hardware - these "modems" were not really modems at all; they were telephone interfaces and digital converters for modem software. At the time I worked for a company producing PSTN modem test equipment, and we tried one of these early HSP modems, and it worked right up until you launched Microsoft Word (or pretty much any large application), when it would instantly drop the connection. Things improved as PCs became faster, but the point is that it was not guaranteed to work - it just mostly did.
Another example I have worked on is on a carton loading machine in food packaging. The product is inserted into the carton, a glue stripe applied, and the closure folded. The carton is moving continuously during this process an the timing of the glue gun is critical - for a glue stripe to be accurate to within one millimetre on a carton moving at one metre per second requires timing within one millisecond.
Another example is that of TDMA communication as used in digital telephony for example. Such communication allocates a time slot for each stations transmission and failure to transmit in exactly the correct time slot, or encroaching on the time slot of another station is unacceptable. Such systems are globally synchronised to atomic-clock accuracy (typically derived from a GPS receiver). A GSM time slot for example is 577 microseconds, in this time, the transmitter must ramp-up the transmitter power, transmit the data and ramp-down
In short any example that requires 100 percent deterministic timing needs an RTOS. If your timing constraints are say > 100ms, and a small probability of failure to meet timing is tolerable, then a GPOS may work all or most of the time. If timing constraints are sub-millisecond or the cost or consequences of failure unacceptable, then an RTOS is appropriate.