Linux Streaming DMA ReMapping without Unmapping - linux-device-driver

I have noticed that the signature of pci_unmap_sg(I show dma_unmap_sg_attrs, which is called by pci_unmap_sg through two macros and has clear parameter naming) includes the direction and attributes.
static inline void dma_unmap_sg_attrs(
struct device *dev,
struct scatterlist *sg,
int nents,
enum dma_data_direction dir,
struct dma_attrs *attrs)
I wonder why it is necessary to know about the direction and attributes for unmapping. Initially i thought mapping was a bit like malloc and free. But seeing this I wonder if something like the following is legal:
dma_map_sg_attrs(..., dir=DMA_BIDIRECTIONAL,...);
...
dma_unmap_sg_attrs(..., dir=DMA_FROM_DEVICE ,...);
//continue use in TO_DEVICE direction
or
dma_map_sg_attrs(..., dir=DMA_TO_DEVICE,...);
...
dma_map_sg_attrs(..., dir=DMA_FROM_DEVICE ,...);
//start bidirectional use
Also can I do this (stream data via DMA from one device to the other if they cant directly dma each other):
dma_map_sg_attrs(dev1, ..., dir=DMA_FROM_DEVICE ,...);
dma_map_sg_attrs(dev2, ..., dir=DMA_TO_DEVICE ,...);
I tried digging into the function, but ended at get_dma_ops which gets function pointers from a global. But how to follow this code further is another question
Update
I find the synchronization api even more confusing:
pci_dma_sync_sg_for_cpu(
struct pci_dev *hwdev,
struct scatterlist *sg,
int nelems,
int direction
)
What is the reason for this api to know the direction? Is there no way for the api to remember the original mapping direction and is it thus to eleminate sync_for_cpu if we mapped as DMA_TO_DEVICE only?

The answer would be simple and robust simultaneously. It depends how deep you know the CPU architecture you are using. The problem with DMA is CPU caches. CPU cache has to be coherent for DMA. Thus, we have different directions in the sync API since we may access data from both sides DMA and CPU. And like you know already the sync API is a cause of direction parameter in the map API.
P.S. It's not anyhow the comprehensive answer. For that you have to go through specific literature and documentation. I suggest you to start with Documentation/DMA*.txt files.

The tipps from Andy were right, while I thought that most of the DMA_ATTR_* wont help me since my device does the ordering so there is no way for the kernel to handle some of them, however if you look at DMA_ATTR_SKIP_SYNC_CPU you find (from DMA-attributes.txt):
71 DMA_ATTR_SKIP_CPU_SYNC
72 ----------------------
73
74 By default dma_map_{single,page,sg} functions family transfer a given
75 buffer from CPU domain to device domain. Some advanced use cases might
76 require sharing a buffer between more than one device. This requires
77 having a mapping created separately for each device and is usually
78 performed by calling dma_map_{single,page,sg} function more than once
79 for the given buffer with device pointer to each device taking part in
80 the buffer sharing. The first call transfers a buffer from 'CPU' domain
81 to 'device' domain, what synchronizes CPU caches for the given region
82 (usually it means that the cache has been flushed or invalidated
83 depending on the dma direction). However, next calls to
84 dma_map_{single,page,sg}() for other devices will perform exactly the
85 same synchronization operation on the CPU cache. CPU cache synchronization
86 might be a time consuming operation, especially if the buffers are
87 large, so it is highly recommended to avoid it if possible.
88 DMA_ATTR_SKIP_CPU_SYNC allows platform code to skip synchronization of
89 the CPU cache for the given buffer assuming that it has been already
90 transferred to 'device' domain. This attribute can be also used for
91 dma_unmap_{single,page,sg} functions family to force buffer to stay in
92 device domain after releasing a mapping for it. Use this attribute with
93 care!

Related

ThreadX RAM issue on STM32

I'm currently starting to use ThreadX on a STM32 Nucleo-H723ZG (STM32H723ZG MCU).
I noticed that when loading the Nx_TCP_Echo_Server / Nx_TCP_Echo_Client projects from CubeMX, the RAM gets filled up pretty much to the top, which makes me wonder, how I'm supposed to add my own code and data here.
Since I'm pretty new to RAM partitioning, RTOS and similar, I don't have a perfect feeling for what is wrong or right and how to proceed (and if it is a problem at all).
Nevertheless I wonder, if maybe using a different way of partitioning the RAM or by dropping some non-necessary code-parts, the RAM could be freed-up.
Or a different way of thinking:
Since RAM_D1 got filled, but _D2, _D3 and DTCMRAM are pretty much empty, is there a way to use the free RAM for my own purposes (I would like to let SPI and ADC processing run via DMA, so this needs a place to go ....)
Hope my questions are not too confusing ;)
The system has the following amount of RAM, according to STM:
"SRAM: total 564 Kbytes all with ECC, including 128 Kbytes of data TCM RAM for critical real-time data + 432 Kbytes of system RAM (up to 256 Kbytes can remap on instruction TCM RAM for critical real time instructions) + 4 Kbytes of backup SRAM (available in the lowest-power modes)" (see STMs STM32H723ZG MCU product page)
Down below you'll find screenshots of the current RAM usage, for RAM_D1 especially .tcp_sec eats up most of the RAM.
--> Can .tcp_sec be optimized or kicked-out?
If tcp means here the tcp protocol, maybe this can be a way to optimize this, since I'm not sure whether I need a handshake etc., maybe UDP is sufficient (and faster for the ADC data streaming) ... what do you say?
Edit:
The linker-file shows, that there .tcp_sec (NOLOAD) is written ... is NOLOAD maybe a hint on a "placebo" RAM occupation (pre-allocation / reservation, but no actual usage?)
Linker-script extract:
/* User_heap_stack section, used to check that there is enough RAM left */
._user_heap_stack :
{
. = ALIGN(8);
PROVIDE ( end = . );
PROVIDE ( _end = . );
. = . + _Min_Heap_Size;
. = . + _Min_Stack_Size;
. = ALIGN(8);
} >RAM_D1
.tcp_sec (NOLOAD) : {
. = ABSOLUTE(0x24048000);
*(.RxDecripSection)
. = ABSOLUTE(0x24048060);
*(.TxDecripSection)
} >RAM_D1 AT> FLASH
For context:
I am developing a "system controller", where my plan is to have it running a RTOS, which manages the read-in of analog values, writing control messages via SPI to two other STMs of the same kind and communicating via Ethernet to my desktop application.
The desktop application is then in charge of post-processing the digitized analog values and sending control messages to the system controller. In the best case the system controller digitizes the analog signal on ADC3 with 5 MSPS (at probably 6 Bit resolution = 30 MBit/s) and sends that data hickup-free to my desktop application.
-> Is this plan possible on this MCU?
I tried to buy a higher (more RAM) version of the nucleo I've got, but due to shortages this one is the best one I was able to get.
For the RTOS I'd like to stick with ThreadX, since FreeRTOS support in STM32IDE seems to be phased out now, after ThreadX was employed as the RTOS by STM.
(I like the easy register configuration using CubeMX/STM32 IDE, hence my drive to use that SW universe ... if there are good reasons to use a different RTOS, tell me :) )
Thank you for your time!
I generated the same project on my side and took a look. I believe you should be able to implement what you want in this CPU. You will need to carefully use the available memory.
It seems there is a confusion about the section .tcp_sec. It contains DMA reception and transmission descriptors for the Ethernet controller/driver. These are constrained by the driver and hardware to be at a specific address. The descriptors are rather small, but the buffers are bigger. With some work these can be reduced. If you are using Ethernet you will need this, no mater if you use TCP or not. As I said, the name can be confusing.
The flash has still plenty of space available. In the debug configuration only about 11% is used. The rest is available for your application code.
You can locate you data in other memory regions. Depending on the toolchain you will use is how you will need to tell the compiler/linker where your data goes. You can look towards the top of the main.c file in that example to see how the DMA descriptors are assigned to a specific section for three different toolchains (IAR, ARM MDK, GCC).
In terms of how to most efficiently use and configure the microcontroller peripherals please get in touch with STMicro, they will know best.
This should get you started. Let us know if this helps!

STM32 - Can I subvert the cool context switching for interrupts?

The STM32 family has fantastic interrupt service, they stack a whole slew of extra registers for you, and load the LR with an artificial return to properly unstack while looking for opportunities for tail chaining, aborted entry, etc etc.
HOWEVER....it is too damn slow. I am finding (STM32F730Z8, 200 MHz clock, all code including handlers in ITCM, everything in GNU assembly) that it takes about 120-150 ns overhead to get into an interrupt.
I am still learning about these, used to the old ARM7 where you had to do it all yourself, however, in those chips, if you had a minimal handler you didn't need to stack much.
So -- can I "subvert" the context switching in hardware, and just have it leap to the handler at elevated priority, pausing only to fill the pipeline, and leaving me to take care of stacking what is needed? I don't think so, and haven't seen a way to do it, but I'm working on an extremely tight time-sensitive realtime code, and interrupt switching is eating all my time budget. I'm reverting to doing it all in low-code, polled, but I hate the jitter that gives me on response to pin edges. Help?
No, this is done in pure hardware and is the main defining feature of all "Cortex-M" processors, not just STM32.
150ns at 200MHz is 30 cycles. You can probably get it quite a bit faster.
One way is to mark the floating point unit as unused each time you finish with it, and to set a core flag to tell it not to save the floating point registers. See ARM application note 298 for details.
Another method that you might try is to move your vector table and interrupt handler code to internal SRAM. STM32 has a flash memory accelerator which avoids most wait states on internal flash by performing prefetch of sequential instructions, but an asynchronous interrupt will probably not benefit from this.
that it takes about 120-150 ns overhead to get into an interrupt.
It is not the truth at all
It takes 60ns as it takes 12 clock cycles.

Loading OS image from floppy disk without INT 13 of BIOS Service

How can I load OS image from floppy disk to memory without BIOS Service while booting my PC?
The only way I’ve used is calling int13h in real mode .
I got to know that I need to handle with ‘Disk controller’ .
Do I need to write kinda ‘Device driver’ in [BIT 16] real mode and is it possible?
As 0andriy has commented, you will have to communicate with the floppy controller directly, bypassing the BIOS. (Which BTW, why do you want to do such a thing? The BIOS was made specifically so you don't have to do this. Is it solely because you want to, maybe to learn how to program the FDC? I'm okay with that.)
The FDC (Floppy Disk Controller) is of the ISA (Industry Standard Architecture) era, back when I/O ports were hard coded to specific addresses. The FDC came in many variants, but most followed a standard rule. The original 756 was a common FDC, with later (still really old to today's standards) controllers following the 82077AA variant.
These controllers had twelve (12) registers using eight (8) I/O Byte addresses, Base + 00h to Base + 07h. (Please note that a single I/O address can be two registers if one is a read and one is a write.) You read and write to these registers to instruct the FDC to do things, such as start the motor for drive 1. (For fun: Did you know that the FDC was originally capable of handling four drives?)
This isn't to difficult to do, but now you have to have some way for the ISA bus to communicate with the FDC and the main memory. In comes the DMA (Direct Memory Access). Now you have to also program the DMA to make the transfers.
Here is the catch. If you don't have all of the FDC and DMA code within the first 512 bytes of the floppy, the 512 bytes the BIOS loaded for you already, there is no way to load the remaining sectors. For example, you can't have your DMA code in the second sector of your boot code expecting to call it, since you have to use that DMA to load that sector in the first place. All FDC and DMA code, at least a minimum read service, must be in the first sector of the disk. This is quite difficult to do, reliably.
I am not saying it is impossible to do, I am just saying it is improbable. For one thing, if you can do it (reliably) in 512 bytes, I would like to see it. It might be a fun experiment. Anyway, do a search for FDC, DMA, etc., things I wrote of here. There are many examples on the web. If you wish to read a book about it, I wrote such a book a while back with all the juicy details.

pycuda shared memory up to device hard limit

This is an extension of the discussion here: pycuda shared memory error "pycuda._driver.LogicError: cuLaunchKernel failed: invalid value"
Is there a method in pycuda that is equivalent to the following C++ API call?
#define SHARED_SIZE 0x18000 // 96 kbyte
cudaFuncSetAttribute(func, cudaFuncAttributeMaxDynamicSharedMemorySize, SHARED_SIZE)
Working on a recent GPU (Nvidia V100), going beyond 48 kbyte shared memory requires this function attribute be set. Without it, one gets the same launch error as in the topic above. The "hard" limit on the device is 96 kbyte shared memory (leaving 32 kbyte for L1 cache).
There's a deprecated method Fuction.set_shared_size(bytes) that sounds promising, but I can't find what it's supposed to be replaced by.
PyCUDA uses the driver API, and the corresponding function call for setting a function dynamic memory limits is cuFuncSetAttribute.
I can't find that anywhere in the current PyCUDA tree, and therefore suspect that it has not been implemented.
I'm not sure if this is what you're looking for, but this might help someone looking in this direction.
The dynamic shared memory size in PyCUDA can be set either using:
shared argument in the direct kernel call (the "unprepared call"). For example:
myFunc(arg1, arg2, shared=numBytes, block=(1,1,1), grid=(1,1))
shared_size argument in the prepared kernel call. For example:
myFunc.prepared_call(grid, block, arg1, arg2, shared_size=numBytes)
where numBytes is the amount of memory in bytes you wish to allocate at runtime.

Determining whether Renderscript is running on CPU/GPU & Number of Threads

I can't seem to find any documentation on how to check if RenderScript is actually parallelizing code. I'd like to know whether the CPU or GPU is being utilized and the number of threads dispatched.
The only thing I've found is this bug report:
http://code.google.com/p/android/issues/detail?id=28662
The author mentions that putting rsForEach in the script resulted it it being serialized by pointing to the following debug output:
01-02 00:21:59.960: D/RenderScript(1256): = 0 0x0
01-02 00:21:59.976: D/RenderScript(1256): = 1 0x1
I tried searching for a similar string in LogCat, but I haven't been able to find a match.
Any thoughts?
Update:
Actually I seem to have figured it out. It just seems that my LogCat foo isn't as good as it should be. I filtered the debug output by my application information and found a line like this:
02-26 22:30:05.657: V/RenderScript(26113): rsContextCreate dev=0x5cec0458
02-26 22:30:05.735: V/RenderScript(26113): 0x5d9f63b8 Launching thread(s), CPUs 2
This will only tell you how many CPUs could be used. This will not indicate how many threads or which processor is being used. By design RS avoid exposing this information
In general RS will use all the available CPU cores unless you call a "serial" function such as the rsg* or time functions. As to what criteria will result in a script being punted from the GPU to CPU, this will vary depending on the abilities of each vendors GPU.
The bug you referenced has been fixed in 4.1
I came across the same issue, when I was working with RS. I used Nexus 5 for my testing. I found that initial launch of RS utilized CPU instead of using GPU, this is verified using Trepn 5.0s application. Later I found that Nexus-5 GPU doesnt support double precision (Link to Adreno 330), so by default it ports it onto CPU. To overcome this I used #pragma rs_fp_relaxed top of my rs file along with header declarations.
So if you strictly want to port it onto GPU then it may be best way to find out your mobile GPU specs and try above trick and measure GPU utilization using Trepn 5.0s or equivalent application. As of now RS doesnt expose thread level details but during the implementation we can utilize the x and y arguments of our root - Kernel as thread indexes.
Debugging properties
RenderScript includes the debug.rs.default-CPU-driver and debug.rs.script debugging properties.
debug.rs.default-CPU-driver
Values = 0 or 1
Default value = 0
If set to 1, the Android Open Source Project (AOSP) implementation of RenderScript Compute
is used. This does not use any GPU features.
debug.rs.script
Values = 0 or 1
Default value = 0
If set to 1, additional diagnostic information is printed in the logcat. This information includes
the actual device a kernel is running on, either GPU or application processor.
If a kernel cannot be run on the GPU more detailed information is provided explaining why. For
example:
[RS-DIAG] No support for recursive calls on GPU
Arm® Mali™ RenderScript Best Practices_pdf