DMA memory (first 2 GB) - linux-device-driver

I would like to alloc memory for a dma transfer between a PCI device and RAM. I am using the next function in order to alloc memory:
pci_alloc_consistent
I want to alloc memory under 2GB. Otherwise, the pci device fail. But this function provides any address from my ram (4GB). Are there any solution?
BR

You can call pci_set_coherent_dma_mask on the struct pci_dev *, or preferably call dma_set_coherent_mask on pcidev->dev to set the mask. Setting the mask to DMA_BIT_MASK(31) will restrict coherent mappings to the first 2 GiB. For consistency, you might also want to restrict the non-coherent mappings by calling pci_set_dma_mask or dma_set_mask.
A good place to call the above functions is from your PCI driver "probe" function.

Related

How does SEND bandwidth improve when the registered memory is aligned to system page size? (In Mellanox IBD)

Operating System: RHEL Centos 7.9 Latest
Operation:
Sending 500MB chunks 21 times from one System to another connected via Mellanox Cables.
(Ethernet controller: Mellanox Technologies MT28908 Family [ConnectX-6])
(The registered memory region (500MB) is reused for all the 21 iterations.)
The gain in Message Send Bandwidth when using aligned_alloc() (with system page size 4096B) instead of malloc() for the registered memory is around 35Gbps.
with malloc() : ~86Gbps
with aligned_alloc() : ~121Gbps
Since the CPU is not involved for these operations, how is this operation faster with aligned memory?
Please provide useful reference links if available that explains this.
What change does aligned memory bring to the read/write operations?
Is it the address translation within the device that gets improved?
[Very limited information is present over the internet about this, hence asking here.]
RDMA operations use either MMIO or DMA to transfer data from the main memory to the NIC via the PCI bus - DMA is used for larger transfers.
The behavior you're observing can be entirely explained by the DMA component of the transfer. DMA operates at the physical level, and a contiguous region in the Virtual Address Space is unlikely to be mapped to a contiguous region in the physical space. This fragmentation incurs costs - there's more translation needed per unit of transfer, and DMA transfers get interrupted at physical page boundaries.
[1] https://www.kernel.org/doc/html/latest/core-api/dma-api-howto.html
[2] Memory alignment

Why place of Mem[MA] in MB then copy from MB to IR rather than going straight from Mem[MA] to IR?

During the fetch stage of the fetch-execute cycle, why are the contents of the cell whose address is in the MA (memory address register) placed in MB (memory buffer) then copied to IR (instruction register), rather than placing the contents of address of MA directly in the IR?
In theory it would be possible to send instruction fetch memory data directly to the IR (or to both the MB and the IR) — this would require extra hardware: wires and muxes.
You may notice that the architecture (depending on which one it is) makes use of few (one or two) busses, and this would effectively add another bus.  So, I think that all we can say is that simplicity is the reason.  Back in the day when processors were this simple, transistor counts were very limited for integrated circuits.
Going in the direction of making things more efficient, nowadays, even simple processors separate instruction (usually cache) memory from data (usually cache) memory.  This independence accomplishes a number of improvements.  MIPS, even the unpipelined single cycle processor, for example:
First, the PC (program counter) register replaces the MA for the instruction fetch side of things and the IR replaces the MB (as if loading directly into that register as you're suggesting), but let's also note that the IR can be reduced from being a true register to being wires whose output is stable for the cycle and thus can be worked on by a decode unit directly.  (Stability is gained by not sharing the instruction memory hardware with the data memory hardware; whereas with only a single memory interface, data has to be copied around and stored somewhere so the interface can be shared for both code & data.)
That saves both the cycle you're referring to: to transfer data from MB to IR, but also the cycle before to capture the data in the MB register in the first place.  (Generally speaking, enregistering some data requires a cycle, so if you can feed wires without enregistering, that's better, all other factors being the same.)
(Also depending on the architecture you're looking at, the PC on MIPS uses dedicated increment unit (adder) rather than attempting to share the main ALU and/or the busses for that increment — that could also save a cycle or two.)
Second, meanwhile the data memory can run concurrently with the instruction memory (a nice win) executing a data load from memory or store to memory in parallel with the fetch of the next instruction.  The data side also forgoes the MB register as temporary parking place, and instead can load memory data directly into a processor register (the one specified by the load instruction).
Having two dedicated memories creates an independence that reduces the need for register capture while also allowing for parallelism, of course requiring more hardware for the design.

What is the real use of logical addresses?

This is what I understood of logical addresses :
Logical addresses are used so that data on the physical memory do not get corrupted. By the use of logical addresses, the processes wont be able to access the physical memory directly, thereby ensuring that it cannot store data on already accessed physical memory locations and hence protecting data integrity.
I have a doubt whether it was really necessary to use logical addresses. The integrity of the data on the physical memory could have been preserved by using an algorithm or such which do not allow processes to access or modify memory locations which were already accessed by other processes.
"The integrity of the data on the physical memory could have been preserved by using an algorithm or such which do not allow processes to access or modify memory locations which were already accessed by other processes."
Short Answer: It is impossible to devise an efficient algorithm as proposed to match the same level of performance with logical address.
The issue with this algorithm is that how are you going to intercept each processes' memory access? Without intercepting memory access, it is impossible to check if a process has privileges to access certain memory region. If we are really going to implement this algorithms, there are ways to intercept memory access without using the logical address provided by MMU (Memory management unit) on modern cpus (Assume you have a cpu without MMU). However, those methods will not be as efficient as using MMU. If your cpu does have a MMU, although logical address translation will be unavoidable, you could setup a one-to-one to the physical memory.
One way to intercept memory access without MMU is to insert kernel trap instruction before each memory access instruction in a program. Since we cannot trust user level program, such job cannot be delegated to a compiler. Thus, you can write an OS which will do this job before it loads a program into memory. This OS will scan through the binary of your program and insert kernel trap instruction before each memory access. By doing so, kernel can inspect if a memory access should be granted. However, this approach downgrades your system's performance a lot as each memory access, legal or not, will trap into the kernel. And trapping into kernel involves context switching which takes a lot of cpu cycles.
Can we do better? What about do a static analysis of memory access of our programs before we load it into memory so we only insert trap before illegal memory access? However, processes has no predefined execution order. Let's say you have programs A and B. They both try to access the same memory region. Then who should get it with our static analysis? We could randomly assign to one of them. Let's say we assign to B. Then how do we know when will B be done with this memory so we can give to A so it can proceed? Let's say B use this region to hold a global variable, which accessed multiple times throughout its life cycle. Do we wait till the completion of B to give this region to A? What if B never ends?
Furthermore, a static analysis of memory access would be impossible with the present of dynamic memory allocation. If either program A or B tries to allocate a memory region which size depends on user input, then OS or our static analysis tool cannot know ahead of time of where or how big the region is. And thus would not be able to do analysis at all.
Thus, we have to fall back to trap on every memory access and determine if access is legal on runtime. Sounds familiar? This is the function of MMU or logical address. However, with logical address, a trap is incurred if and only if a illegal access has happened instead of every memory access.
It is simulated by the OS to programs as if they were using physical memory. The need of the extra layer (logical address) is necessary for data-integrity purposes. You can make the analogy of logical addresses as the language of OS for addresses because without this Mapping, OS would not be able to understand what are the "actual" addresses allowed to any program. To remove this ambiguity, logical address mapping is required so that the OS know what logical address maps to what physical addressing and whether that physical address location is allowed to that program. It performs the "integrity checks" on logical addresses and not on physical memory because you can check the integrity by changing the logical address and do manipulations but you cant really do the same on physical memory because it would affect the already running processes using the memory.
Also I would like to mention that the base register and limit register are loaded by executing privileged instructions and privileged instructions are executed in kernel mode and only operating system has access to kernel mode and therefore CPU cannot directly access the registers. I hope I helped a little :)
There are some things that you need to understand.
First of all a CPU is unable to access the physical memory directly. In order to calculate the physical address a CPU needs a logical address. Logical address is then used compute the physical address. So this is the basic need of logical addresses to access physical memory. Without logical address you cannot access it. This conversion is necessary. Suppose if there is a system which do not follow virtual/logical addresses, that system will become highly vulnerable to hacker or intruder as they can access physical memory directly and manipulate the useful data on any location.
Second thing, when a process runs, CPU generates logical address in order to load that process on main memory. Now the purpose of this logical address here is, the memory management. The size of registers are very less as compared to the actual size of process. So we need to relocate the memory in order to obtain the optimum efficiency. MMU (Memory Management Unit) comes into play here. Physical memory is calculated by MMU using the logical address. So logical addresses are generated by processes and MMU access physical address based on that logical address.
This example will make it clear.
If data is stored on address 50, base register holds the value 50 and offset holds 0. Now, MMU shifts it to address 100, this would be reflected in logical address as well. Offset becomes 100-50=50. So, now if data is needed to be retrieved via logical address, it goes to base address 50 and then see the offset i.e. 50, it goes to address 100 and access data. Logical address keeps the record of the data where it has been moved. No matter how many address locations that data change, it will be reflected in logical address and hence this logical address give accessibility to that data whatever physical address it holds now.
I hope it helps.

HSA Data copy between RAM and GPU-RAM

Reading the wikipage about HSA found this block diagram.
Could not understand the benefits of passing pointer through PCI-ex
Does this avoids data copying from system memory to graphics memory ?
As far as I understand to process the content of the pointer the GPU will need it to be present in the graphics memory.
If you have separate graphics memory, but you're doing HSA, you have to somehow unify the address spaces. The CPU can see graphics memory, mapped to physical address space. And the GPU can access main memory via DMA. You can set up the CPU and GPU with page tables that direct the same virtual addresses to the same place, which will require one of them to go (transparently) over the PCIe bus.
Where you save time and energy is that you don't have to copy everything you MIGHT want to access; the CPU and GPU access only the data they actually need to use.

How does CPU access BIOS instructions stored in external memory?

During the process of booting, CPU reads address of system BIOS from the Reset Vector and jumps to the location where BIOS is stored. My question here is:
*As BIOS is stored on some external memory like EEPROM (and not on main memory) , how does CPU access this external memory ?
*Is this external memory already mapped to some region of main memory?
and does the CPU just jump to this mapped region to access BIOS instructions
Or it actually accesses the instructions from external memory where BIOS is stored?
First I can refer you to a detailed article:
https://resources.infosecinstitute.com/system-address-map-initialization-x86x64-architecture-part-2-pci-express-based-systems/#gref
But I will summarize here:
When CPU is "resetted", the reset vector interrupt (a specific memory address - 0xFFFFFFF0H) is executed - and the ROM content has to be there at that specific address.
Intel Reset Vector
How is the BIOS ROM mapped into address space on PC?
Who loads the BIOS and the memory map during boot-up
0xffff0 and the BIOS (hardwired address mapping is also explained/emphasized here)
When BIOS is executed, it will also initialize hardware like VGA, and initialize DRAM memory. Sometimes RAM memory and BIOS may overlapped, and usually the OS will takeover and reimplement all the functionalities of the BIOS (whis is specific to each motherboard).
What information does BIOS load into RAM?
https://resources.infosecinstitute.com/system-address-map-initialization-in-x86x64-architecture-part-1-pci-based-systems/
Diagram below illustrate how motherboard designer will design the address ranges usable by the different hardware peripherals to lie in certain ranges, and the OS then has the responsibilities to allocate RAM ranges to lie in the unused by hardware regions. Don't forget that each core (for 32-bit) can only access 4GB memory - but phyical memory available can be much more than that. This is where pagetable comes in.
Once the pagetable is setup, then only the TLB and pagetable can be used - which is to provide indirect and efficient access to the RAM memory.
Normally the CPU access the data and information through by interfacing with the SPI in turn communicates with the EEEPROM to fulfill the task requested or deliver the information requested by the CPU.
And no, the external memory is not mapped anywhere and no the CPU does not just jump to it. It communicates with what it or the BIOS needs through SPI or I^C depending on the age of the machine.