I wonder if it is safe to read from the memory that dma is writing data to?
I have a stm32F1 with an adc setup to continuously perform conversions and transfer the data to a ram buffer using dma.
I know I can use an adc interrupt to access the buffer safely, but how about accessing the buffer from a non-interrupt context? Could data be corrupted if I try to read from the same location that dma is writing?
Your data will not be corrupted - these chips have bus arbiter that grants access to bus (so also to RAM memory) to either DMA or the CPU (your code), so each transaction (single access to RAM, not necessarily access to whole variable) is atomic.
See this info in the RM0008 Reference manual:
3.1 System architecture
...
BusMatrix
The BusMatrix manages the access arbitration between the
core system bus and the DMA master bus. The arbitration uses a Round
Robin algorithm. In connectivity line devices, the BusMatrix is
composed of five masters (CPU DCode, System bus, Ethernet DMA, DMA1
and DMA2 bus) and three slaves (FLITF, SRAM and AHB2APB bridges). In
other devices, the BusMatrix is composed of four masters (CPU DCode,
System bus, DMA1 bus and DMA2 bus) and four slaves (FLITF, SRAM, FSMC
and AHB2APB bridges). AHB peripherals are connected on system bus
through a BusMatrix to allow DMA access.
Related
My question relates to
What is disadvantage of calling sleep() inside mutex lock?
I have a similar situation, and I am not quite understanding why sleeping while holding a mutex is such a no-no. What are the consequences, in terms of deadlock, starvation, correctness, etc, of doing this?
I have an embedded system, using an RTOS. That RTOS supports osDelay (sleep) and mutexes.
I have 2+ threads that may require access to an i2c bus peripheral in order to write to an eeprom device on that bus. There may be other devices on that same i2c bus.
A mutex is taken to ensure exclusive access to the i2c peripheral (and hence bus). During the period of holding that mutex, I have to delay 10ms between consecutive eeprom writes. This is just a requirement of the eeprom chip. eeprom reads don't require delays.
How can I achieve my hardware constraints WITHOUT 'delay inside mutex'?
Operating System: RHEL Centos 7.9 Latest
Operation:
Sending 500MB chunks 21 times from one System to another connected via Mellanox Cables.
(Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6])
(The registered memory region (500MB) is reused for all the 21 iterations.)
The gain in Message Send Bandwidth when using aligned_alloc() (with system page size 4096B) instead of malloc() for the registered memory is around 35Gbps.
with malloc() : ~86Gbps
with aligned_alloc() : ~121Gbps
Since the CPU is not involved for these operations, how is this operation faster with aligned memory?
Please provide useful reference links if available that explains this.
What change does aligned memory bring to the read/write operations?
Is it the address translation within the device that gets improved?
[Very limited information is present over the internet about this, hence asking here.]
RDMA operations use either MMIO or DMA to transfer data from the main memory to the NIC via the PCI bus - DMA is used for larger transfers.
The behavior you're observing can be entirely explained by the DMA component of the transfer. DMA operates at the physical level, and a contiguous region in the Virtual Address Space is unlikely to be mapped to a contiguous region in the physical space. This fragmentation incurs costs - there's more translation needed per unit of transfer, and DMA transfers get interrupted at physical page boundaries.
[1] https://www.kernel.org/doc/html/latest/core-api/dma-api-howto.html
[2] Memory alignment
I am learning about the STM32 F4 microcontroller. I'm trying to find out about limitations for using DMA.
Per my understanding and research, I know that if the data size is small (that is, the device uses DMA to generate or consume a small amount of data), the overhead is increased because DMA transfer requires the DMA controller to perform operations, thereby unnecessarily increasing system cost.
I did some reaserch and found the following:
Limitation of DMA
CPU puts all its lines at high impedance state so that the DMA controller can then transfer data directly between device and memory without CPU intervention. Clearly, it is more suitable for device with high data transfer rates like a disk.
Over a serial interface, data is transferred one bit at a time which makes it slow to use DMA.
Is that correct? What else do I need to know?
DMA -CPU puts all its lines at high impedance state
I do not know where did you take it from - but you should not use this source any more.
Frequency of the DMA transfers do not matter unless you reach the the BUS throughput. you can transfer one byte per week, month, year, decade ..... and it is absolutely OK.
In the STM32 microcontrollers it is a very important feature as we can transfer data from/to external devices even if the uC is in low power mode with the core (CPU) sleeping. DMA controller can even wake up the core when some conditions are met.
As #Vinci and #0___________ (f.k.a. #P__J__) already pointed out,
A DMA controller works autonomously and doesn't create overhead on the CPU it supplements (at least not by itself). But:
The CPU/software must perform some instructions to configure the DMA and to trigger it or have it triggered by some peripheral. For this, it needs CPU time and program memory space (usually ROM). Besides, it usually needs some additional RAM in variables to manage the software around the DMA.
Hence, you are right, using a DMA comes with some kinds of overhead.
And furthermore,
The DMA transfers make use of the memory bus(es) that connect the involved memories/registers/peripherals to the DMA controller. That is, while the DMA controller does its own work, it may cause the CPU which it tries to offload to stall in the meantime, at least for short moments when the data words are transferred (which in turn sum up for longer transfers...).
On the other hand, a DMA doesn't only help you to reduce the CPU load (regarding total CPU time to implement some feature). If used "in a smart way", it helps you to reduce software latencies to implement different functions because one part of the implementation can be "hidden" behind the DMA-driven data transfer of another part (unless, both rely on the same bus resources - see above...).
The information is right in that using a DMA requires some development work and some runtime to manage the DMA transfer itself (see also
a related question
here), which may not be worth the benefits of using DMA. That is, for small portions of data one doesn't gain as much performance (or latency) as during big transfers. On embedded systems, DMA controllers (and their channels) are limited resources so it is important to consider which part of the function benefits from such a resource most. Therefore, one would usually prefer using DMA for the data transfers to/from disks (if it is about "payload data" such as large files or video streams) over slow serial connections.
The information is wrong, however, in that DMA is not worth using on serial interfaces as those only transfer a single bit at a time. Please note that microcontrollers (as your
STM32F4)
have built-in peripheral components that convert the serial bit-by-bit stream into a byte-by-byte or word-by-word stream, which can easily be tranferred by DMA in a helpful way - especially if the size of the packets is known in advance and software doesn't have to analyse a non-formatted stream. Furthermore, not every serial connection is "slow" at all. If the project uses, e. g., an SPI flash chip, then the SPI serial connection is the one used for data transfer.
I have two questions.
memory region of the cortex-m core cpu
1- is the memory of the stm32 microcontrollers inside the cortex-m core or outside of it? and if it is inside the cortex-core why is it not shown in the block diagram of the cortex-m core generic user guide?block diagram of the cortex-m core
2-I'm trying to understand the stm32 architecture but I'm facing an ambiguity.
usart block diagram
as you can see in the picture the reference manual says that the USART unit has some registers(i.e Data Register).
but these registers also exist in the memory region of the cortex-m core(if the answer to the first question is "inside").where are they really? are there two registers for each register? or are they resided in the cortex-m core or in the peripheral itself?
is it related to the memory-mapped i/o definition?
The only storage that's inside the CPU core is the registers (including general-purpose and special-purpose registers). Everything else is external, including RAM and ROM.
The peripheral control registers exist, essentially, inside the peripheral. However they are accessed by the CPU in the same way that it accesses RAM or ROM; that's the meaning of the memory map, it shows you which addresses refer to RAM, ROM, peripheral registers, and other things. (Note that most of the memory map is unused - a 32-bit address space is capable of addressing 4GB of memory, and no microcontroller that I know of has anything like that much storage.) The appropriate component 'responds' to read and write requests on the memory bus depending on the address.
For a basic overview the Wikipedia page on memory-mapped IO is reasonably good.
Note that none of this is specific to the Cortex-M. Almost all modern microprocessor designs use memory mapping. Note also that the actual bus architecture of the Cortex-M is reasonably complex, so any understanding you gain from the Wikipedia article will be of an abstraction of the true implementation.
Look at the image below, showing the block diagram of an STM32 Cortex-M4 processor.
I have highlighted the CPU Core (top left); and other components you can find inside the microcontroller.
The CPU "core", as its name implies, is just the "core"; but the microcontroller also integrates a Flash memory, a RAM, and a number of peripherals; almost everything outside the core (except debugging lines) is accessed by means of the bus matrix, this is equally true for ROM, RAM, and integrated peripherals.
Note that the main difference between a "microprocessor" and a "microcontroller" is that a the latter has dedicated peripherals on board.
Peripherals on STM32 devices are accessed by the CPU via memory-mapped I/O, look at the picture below:
As you can see, despite a linear address space from 0x00000000 to 0xFFFFFFFF, the address space is partitioned in "segments", f.e., program memory starting at 0x00000000, SRAM at 0x20000000, peripherals at 0x40000000. Specific peripheral registers can be read/written by pointers at specific offsets from the base address.
For this device, USARTS are mapped to APB1 area, thus in address range 0x40000000-0x4000A000. Note that the actual peripheral addresses can be different from device to device.
Peripherals are connected to core via buses. The address decoder knows which address is handled by the particular bus.
Not only peripherals are connected via buses. Memories are connected the same way. Busses are interconnected via the bridges. Those bridges know how to direct the traffic.
From the core point of view the peripheral register works the same way as the memory cell.
What about the gaps. Usually if the address decoder does not understand the address it will generate the exception - hardware error (in the ARM terminology called HardFault)
Details are very complicated and unless you are going to design your own chip not needed for the register level programmer.
I have this basic doubt about DMA. When the CPU has relinquished the bus for DMA to carry on with data fetching/ storing, how does it continue processing?
I mean even the CPU's got to get it instructions, store back results to the memory/IOs through the bus, does it not?
CPUs have cache, so they can do a lot without any actual main-memory accesses. Even low-power systems tend to have caches, because driving signals off-chip costs enough energy that a cache pays for itself in energy saved by cache hits.
More importantly, DMA doesn't "take over" the RAM, or even necessarily saturate memory bandwidth. The CPU doesn't "relinquish the bus"; the memory controller accepts read/write requests from the CPU core(s) and other system devices. Running a memory-heavy task on the CPU will slow down delay DMA, as well as the other way around, as the memory controller or system agent arbitrates access to memory, queuing read and write requests from all sources.
DMA is great for transfers that are still much slower than memory bandwidth. For example SATAIII is 6 Gbits/s, while main memory bandwidth for dual-channel DDR3-1600MHz is about 25 GBytes/s. So programmed-io would spend most of its time waiting for data from the SATA controller, not even bottlenecked on storing to RAM.
An example of how the pieces fit together in a modern Intel x86 CPU:
this diagram of Intel Skylake's system architecture (including eDRAM as memory-side cache). Sorry I didn't find a simpler diagram showing just the cores and system agent, but in a system without eDRAM, the only thing to the right of the system agent is the memory controller, and everything else stays the same.
The memory controller is on-die, so the only off-chip connection in this diagram is the PCIe bus.
There are two basic types of DMA usage models. First is when a CPU is waiting for the DMA to finish - SYNCed operation or a blocking DMA call. The other is when the CPU makes an ASYNC (or non-blocking) DMA request. This lets CPU continue with the regular control flow. This way it can off-load work to DMA to do something more important.
If I understand your question correctly, and as Peter said, when a CPU has made a non-blocking DMA request, and the DMA is actively doing something on the bus, still CPU can do all the regular operations including accessing the RAM because the bus can have multiplexed traffic. Or in other words the bus can handle multiple masters at the same time.
The coherency and consistency, which makes things little more complicated, are generally maintained using the right programming paradigms based on the hardware support.