What is a limitation of using DMA transfer for slow periodic data? [duplicate]

What is a limitation of using DMA transfer for slow periodic data? [duplicate] - stm32

I am learning about the STM32 F4 microcontroller. I'm trying to find out about limitations for using DMA.
Per my understanding and research, I know that if the data size is small (that is, the device uses DMA to generate or consume a small amount of data), the overhead is increased because DMA transfer requires the DMA controller to perform operations, thereby unnecessarily increasing system cost.
I did some reaserch and found the following:
Limitation of DMA
CPU puts all its lines at high impedance state so that the DMA controller can then transfer data directly between device and memory without CPU intervention. Clearly, it is more suitable for device with high data transfer rates like a disk.
Over a serial interface, data is transferred one bit at a time which makes it slow to use DMA.
Is that correct? What else do I need to know?

DMA -CPU puts all its lines at high impedance state
I do not know where did you take it from - but you should not use this source any more.
Frequency of the DMA transfers do not matter unless you reach the the BUS throughput. you can transfer one byte per week, month, year, decade ..... and it is absolutely OK.
In the STM32 microcontrollers it is a very important feature as we can transfer data from/to external devices even if the uC is in low power mode with the core (CPU) sleeping. DMA controller can even wake up the core when some conditions are met.

As #Vinci and #0___________ (f.k.a. #P__J__) already pointed out,
A DMA controller works autonomously and doesn't create overhead on the CPU it supplements (at least not by itself). But:
The CPU/software must perform some instructions to configure the DMA and to trigger it or have it triggered by some peripheral. For this, it needs CPU time and program memory space (usually ROM). Besides, it usually needs some additional RAM in variables to manage the software around the DMA.
Hence, you are right, using a DMA comes with some kinds of overhead.
And furthermore,
The DMA transfers make use of the memory bus(es) that connect the involved memories/registers/peripherals to the DMA controller. That is, while the DMA controller does its own work, it may cause the CPU which it tries to offload to stall in the meantime, at least for short moments when the data words are transferred (which in turn sum up for longer transfers...).
On the other hand, a DMA doesn't only help you to reduce the CPU load (regarding total CPU time to implement some feature). If used "in a smart way", it helps you to reduce software latencies to implement different functions because one part of the implementation can be "hidden" behind the DMA-driven data transfer of another part (unless, both rely on the same bus resources - see above...).

The information is right in that using a DMA requires some development work and some runtime to manage the DMA transfer itself (see also
a related question
here), which may not be worth the benefits of using DMA. That is, for small portions of data one doesn't gain as much performance (or latency) as during big transfers. On embedded systems, DMA controllers (and their channels) are limited resources so it is important to consider which part of the function benefits from such a resource most. Therefore, one would usually prefer using DMA for the data transfers to/from disks (if it is about "payload data" such as large files or video streams) over slow serial connections.
The information is wrong, however, in that DMA is not worth using on serial interfaces as those only transfer a single bit at a time. Please note that microcontrollers (as your
STM32F4)
have built-in peripheral components that convert the serial bit-by-bit stream into a byte-by-byte or word-by-word stream, which can easily be tranferred by DMA in a helpful way - especially if the size of the packets is known in advance and software doesn't have to analyse a non-formatted stream. Furthermore, not every serial connection is "slow" at all. If the project uses, e. g., an SPI flash chip, then the SPI serial connection is the one used for data transfer.

Related

How does SEND bandwidth improve when the registered memory is aligned to system page size? (In Mellanox IBD)

Operating System: RHEL Centos 7.9 Latest
Operation:
Sending 500MB chunks 21 times from one System to another connected via Mellanox Cables.
(Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6])
(The registered memory region (500MB) is reused for all the 21 iterations.)
The gain in Message Send Bandwidth when using aligned_alloc() (with system page size 4096B) instead of malloc() for the registered memory is around 35Gbps.
with malloc() : ~86Gbps
with aligned_alloc() : ~121Gbps
Since the CPU is not involved for these operations, how is this operation faster with aligned memory?
Please provide useful reference links if available that explains this.
What change does aligned memory bring to the read/write operations?
Is it the address translation within the device that gets improved?
[Very limited information is present over the internet about this, hence asking here.]

RDMA operations use either MMIO or DMA to transfer data from the main memory to the NIC via the PCI bus - DMA is used for larger transfers.
The behavior you're observing can be entirely explained by the DMA component of the transfer. DMA operates at the physical level, and a contiguous region in the Virtual Address Space is unlikely to be mapped to a contiguous region in the physical space. This fragmentation incurs costs - there's more translation needed per unit of transfer, and DMA transfers get interrupted at physical page boundaries.
[1] https://www.kernel.org/doc/html/latest/core-api/dma-api-howto.html
[2] Memory alignment

Throughput vs latency in computer architecture

I've come across articles on "through-put vs latency" in contexts like networking e.g. https://homepage.cs.uri.edu/~thenry/resources/unix_art/ch12s04.html But in the context of computer architecture / operating systems, I'm not able to understand why would there be a trade-off between latency (response time of a program) and through-put (how many programs we're able to complete in a unit of time, say per hour). Is this solely due to the fact that we can choose to parallelize processing of multiple programs / requests leading to overheads like context switches & sharing of caches which make the start-to-end response time per process to be worse? Or am I missing something here?

In terms of single instructions in a superscalar pipelined out-of-order exec CPU, throughput vs. latency is very important because the CPU is trying to extract parallelism from an instruction stream that has to be executed as if in serial program order. See Assembly - How to score a CPU instruction by latency and throughput and the bottom of my answer on latency vs throughput in intel intrinsics for example.
In terms of OS decisions that affect throughput vs. latency on a much longer timescale than a few clock cycles, that's a totally separate question.
One of the major factors there is choosing how to use the available physical RAM, and whether to page out (to a swap file) infrequently used code / data to make more room to cache disk files. (e.g. Linux's vm.swappiness is widely considered a key tunable in terms of setting it differently between servers and desktops. https://unix.stackexchange.com/questions/88693/why-is-swappiness-set-to-60-by-default).
If you alt-tab to a window when many pages of that process have been paged out, it will take some time before the process can redraw its window. (Multiple hard page faults, can be quite slow especially if paging on a rotational disk, not SSD.) So to optimize for latency, you want the kernel to not aggressively swap out pages from running processes, even if they've been idle for a few hours. Those pages, if they'd been free, could have improved throughput for other processes by acting as buffers / cache.
A related factor is I/O scheduling: trying to group IO requests together to minimize HD seek times (for higher throughput and lower average latency), but sometimes at the expense of delaying a few requests for a longer time (higher worst-case latency). Linux for example has many to choose from, including deadline, Completely Fair Queuing (CFQ), and the original elevator (just grouping requests by locality without consideration of fairness or latency). https://wiki.archlinux.org/title/improving_performance#Input/output_schedulers
CPU scheduling is also a factor: a context-switch hurts throughput, as it takes time itself and caches will likely be cold for the new task on this CPU. You also have to run the kernel's schedule() function to decide which task to run next, so that takes away some time from real work.
To minimize latency (for example between a socket message being sent to a process and it waking up when its poll or select system call returns), you want a short timeslice, like Linux HZ=1000. (Timer interrupts every 1 ms to run the scheduler). And you want to be able to pre-empt even the kernel itself, instead of waiting until the kernel is ready to return to the old user-space to consider the possibility of running a different user-space task.
But neither of these helps throughput, and in fact hurt (assuming the workload has enough parallelism to not bottleneck on latency). So HZ=100 was the default for "server" Linux builds, vs. 1000 on "desktop" builds tuned for interactive use. (Modern Linux can be "tickless", not using a fixed timer interrupt on every core at all, instead deciding when to schedule the next interrupt on a case by case basis.)
Real-time kernels take this even further, spending more time on finer-grained locking and stuff like that to enable pausing work and coming back to it later to minimize interrupt latency and other latencies between it being time to do something and actually starting to do that thing. (There are real-time patches for Linux, and there are also totally separate kernels built from the ground up for real-time operation.)
If you have an embedded system controlling a motor or something, you absolutely need hard real-time latency guarantees that it will never take longer than say 1 millisecond from an interrupt pin being asserted to the interrupt handler starting to run.
(Designing the system to make these guarantees possible often comes at the cost of throughput. e.g. obviously you have to pin some memory to make it not swappable, if we're talking about user-space, making it unavailable for cache even if it goes untouched for days.)

Direct Memory Access

I have this basic doubt about DMA. When the CPU has relinquished the bus for DMA to carry on with data fetching/ storing, how does it continue processing?
I mean even the CPU's got to get it instructions, store back results to the memory/IOs through the bus, does it not?

CPUs have cache, so they can do a lot without any actual main-memory accesses. Even low-power systems tend to have caches, because driving signals off-chip costs enough energy that a cache pays for itself in energy saved by cache hits.
More importantly, DMA doesn't "take over" the RAM, or even necessarily saturate memory bandwidth. The CPU doesn't "relinquish the bus"; the memory controller accepts read/write requests from the CPU core(s) and other system devices. Running a memory-heavy task on the CPU will slow down delay DMA, as well as the other way around, as the memory controller or system agent arbitrates access to memory, queuing read and write requests from all sources.
DMA is great for transfers that are still much slower than memory bandwidth. For example SATAIII is 6 Gbits/s, while main memory bandwidth for dual-channel DDR3-1600MHz is about 25 GBytes/s. So programmed-io would spend most of its time waiting for data from the SATA controller, not even bottlenecked on storing to RAM.
An example of how the pieces fit together in a modern Intel x86 CPU:
this diagram of Intel Skylake's system architecture (including eDRAM as memory-side cache). Sorry I didn't find a simpler diagram showing just the cores and system agent, but in a system without eDRAM, the only thing to the right of the system agent is the memory controller, and everything else stays the same.
The memory controller is on-die, so the only off-chip connection in this diagram is the PCIe bus.

There are two basic types of DMA usage models. First is when a CPU is waiting for the DMA to finish - SYNCed operation or a blocking DMA call. The other is when the CPU makes an ASYNC (or non-blocking) DMA request. This lets CPU continue with the regular control flow. This way it can off-load work to DMA to do something more important.
If I understand your question correctly, and as Peter said, when a CPU has made a non-blocking DMA request, and the DMA is actively doing something on the bus, still CPU can do all the regular operations including accessing the RAM because the bus can have multiplexed traffic. Or in other words the bus can handle multiple masters at the same time.
The coherency and consistency, which makes things little more complicated, are generally maintained using the right programming paradigms based on the hardware support.

RTOS example where GPOS will most likely fail

I want to know a few application examples where one needs to use RTOS in order to ensure a working system.
I did some google search and whatever examples I found, I feel could be implemented using a windows or linux system.

The primary difference between an RTOS and a GPOS is that an RTOS guarantees deterministic response. That is to say that the worst case response time to an event is precisely bounded (and usually fast). A GPOS schedules processes generally on "balanced load" basis - it assumes that all processes and events are of equal importance and will be allotted a "fair" share of processor resources. For that reason when a process has the CPU, unless it yields "cooperatively" it will have sole use of the CPU for the duration of its time slot (assuming a single core - multi-core processors allow true concurrency, but the GPOS still allots the cores of a balanced load basis). A time slot may be several tens of milliseconds, and the time taken to service a particular process will depend greatly on the number of processes simultaneously demanding CPU time. Outside perhaps of implementing a kernel level driver, achieving timing constraints of a few tens of microseconds (or less) is not possible (or desirable) in a GPOS.
If your application is what Microsoft's marketing used to call "soft" real-time (i.e. not real time at all) that a GPOS may suit. Linux can be built with "real-time" scheduling support, but it does not really make Linux suited to a large set of "hard" real-time tasks, and it is still "soft" in the sense that most of the time it will meet deadlines, but in some outlier conditions it may fail. If that is your medical life-support system, you probably don't want to trust to that!
As an example of an attempt to run essentially real-time tasks on a GPOS that fails, years ago when MMX instructions were added to Pentium processors (running typically at 60MHz then), someone had the bright idea of "Host Signal Processing", a method applied to reduce the cost of PSTN modems (dial-up) by performing the signal processing on the PC rather than using a dedicated processor or DSP in the modem hardware - these "modems" were not really modems at all; they were telephone interfaces and digital converters for modem software. At the time I worked for a company producing PSTN modem test equipment, and we tried one of these early HSP modems, and it worked right up until you launched Microsoft Word (or pretty much any large application), when it would instantly drop the connection. Things improved as PCs became faster, but the point is that it was not guaranteed to work - it just mostly did.
Another example I have worked on is on a carton loading machine in food packaging. The product is inserted into the carton, a glue stripe applied, and the closure folded. The carton is moving continuously during this process an the timing of the glue gun is critical - for a glue stripe to be accurate to within one millimetre on a carton moving at one metre per second requires timing within one millisecond.
Another example is that of TDMA communication as used in digital telephony for example. Such communication allocates a time slot for each stations transmission and failure to transmit in exactly the correct time slot, or encroaching on the time slot of another station is unacceptable. Such systems are globally synchronised to atomic-clock accuracy (typically derived from a GPS receiver). A GSM time slot for example is 577 microseconds, in this time, the transmitter must ramp-up the transmitter power, transmit the data and ramp-down
In short any example that requires 100 percent deterministic timing needs an RTOS. If your timing constraints are say > 100ms, and a small probability of failure to meet timing is tolerable, then a GPOS may work all or most of the time. If timing constraints are sub-millisecond or the cost or consequences of failure unacceptable, then an RTOS is appropriate.

PLC capability and operating principles

I come from a C/C++ background, a lot of which has been in an embedded systems context. None of those embedded systems have involved PLCs - it had never make sense to have one CPU doing all it's C/C++ logic, then surrendering control of the I/O to some other device when (usually) you could just do it yourself because the I/O was directly connected to your CPU.
With the advent of EtherCAT, we are seeing advantages in moving our I/O onto EtherCAT for its flexiblity, modularity, etc. However, the preferred mode of driving a lot of the EtherCAT hardware seems to be via a PLC. In the case of the Beckhoff TwinCAT PLC environment, trying to bypass the PLC seems to be either technically difficult or expensive or both.
Which makes us want to know a lot of things about PLCs... starting with:
is it best to think about them as a serial processing device, parallel processing device, or neither (does it depend on the specific device)?
are they a "Turing Complete" general purpose computing device, or do they have limitations?
do they run the entire PLC program (loops and all) every PLC cycle?
if the PLC I/O is not controlling some industrial process under the supervision of a maintenance department, and/or takes place on millisecond timescales, might those be good reasons to make full use of more modern programming techniques (structured text rather than ladder diagrams, for example), in contrast to advice in the likes of this answer?

Just to cover both interpretations of serial and parallel -PLC logic processing is sequential.
Most PLC's can be programmed via Serial, USB or Ethernet connections
As regards to devices that PLCs' connect to, they are usually serial. For instance many industrial control system networks uses Profibus which is a serial bus based communication - typically Profibus uses the RS-485 serial interface. I can’t really think of a place where I have seen parallel communication. Most are serial - MODBUS, DeviceNet etc....with parallel you have problems with the extra cost of cabling, noise, long distances etc.
Yes PLC languages are Turing complete but probably not as convenient as other programming languages. For example with a Siemens PLC you have a choice of how you implement the logic - Ladder, S7 Graph (these are graphical based), Statement List (instruction based), Function Block Diagram, Structured Control Language (similar
to Pascal). This is a nice article comparing PLC programming languages with guidelines for how to choose a language http://www.automation.com/pdf_articles/IEC_Programming_Thayer_L.pdf
The PLC scan time is the time taken by a PLC to read inputs, execute the whole program and based on the logic just processed update the outputs accordingly. PLC scan time is not deterministic as it depends on inputs, outputs, timers, memory etc. Usually PLC's are used where speed is required - for slower processes DCS's can be used. It would be usual to see execution times of between 4-6 ms. With most PLC's you can modify the default maximum cycle monitoring time. If this time expires, the CPU can be commanded to stop or an interrupt can be triggered with the required logic. Note in many cases greater then 1 second scan time is "undesirable"!
I have found that in my experience that nearly all of the PLC's I have worked on are never composed of simple ladder logic networks. PLC’s are not simple representations of physical relays. They are used to control intricate often safety critical processes interacting with a multitude of different devices/equipment. Also in the majority of cases you have a SCADA system to implement and you may have enterprise level applications (MES,ERP) system to consider. Many processes require complex scheduling and logic control algorithms -- vial filling, bio pharma, electrical, oil & gas….there is a long list. As per the above link it depends on your need but modern processes often dictate the need for more then a simple program composed of a few ladder networks

More "modern" programming language (actually ST is more modern than C) often means also more complexity on the program, which is something that should be avoided in the PLC world. They are Real Time machines, where the cycle times, maintainability, robustness and clarity are far more important than regular PC (which are not RT) and embedded world. If PLCs would be programmed with same way as most handheld devices, we would be living in the world where having lights on would be totally random act, since the powerplant just tilted because of programming bug.
A Murrays answer is better than I would ever write, but since I can't comment yet I wanted to underline these parts I wrote here.

Categories

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse